[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901968#comment-16901968
 ] 

Antoine Pitrou commented on ARROW-6132:
---

My expectation is that {{ValidateArray}} is O(1), not O(n) in array size.

Perhaps we need a separate {{ValidateArrayData}} that digs deeper...

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901966#comment-16901966
 ] 

Joris Van den Bossche commented on ARROW-6132:
--

I was actually just planning to open an issue for that: should 
{{ValidateArray}} check the indices of a DictionaryArray? 

Not knowingly in detail how {{ValidateArray}} is used internally and what its 
purpose is, I would expect that from the name of that function, but from your 
response it might not?

{{ValidateArray}} for a ListArray also does walk all offsets and check they are 
consistent.

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901954#comment-16901954
 ] 

Antoine Pitrou commented on ARROW-6132:
---

Well, {{DictionaryArray.from_arrays(safe=True)}} is really expensive: it will 
walk all indices and check they are within bounds. {{ValidateArray}} doesn't do 
such a thing.

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901945#comment-16901945
 ] 

Joris Van den Bossche commented on ARROW-6132:
--

{{DictionaryArray.from_arrays}} has a {{safe=True/False}} argument (with 
{{safe=True}} as default) to disable the validity checking. 
Although it is not exactly the same under the hood (DictionaryArray does not 
use the {{ValidateArray}} method), for users it is similar functionality, so I 
could also add such a keyword to {{ListArray.from_arrays}} as well (didn't do 
that yet in the PR https://github.com/apache/arrow/pull/5029).



> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-05 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900260#comment-16900260
 ] 

Uwe L. Korn commented on ARROW-6132:


+1, not getting segfaults or delayed errors on Python APIs is essential.

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-05 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900079#comment-16900079
 ] 

Wes McKinney commented on ARROW-6132:
-

I agree, I think adding validation here is OK

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-05 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900029#comment-16900029
 ] 

Antoine Pitrou commented on ARROW-6132:
---

[~wesmckinn] [~xhochy] What do you think? I think there is value in spending a 
few CPU cycles validating inputs in the Python APIs.

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)