[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901968#comment-16901968 ] Antoine Pitrou commented on ARROW-6132: --- My expectation is that {{ValidateArray}} is O(1), not O(n) in array size. Perhaps we need a separate {{ValidateArrayData}} that digs deeper... > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901966#comment-16901966 ] Joris Van den Bossche commented on ARROW-6132: -- I was actually just planning to open an issue for that: should {{ValidateArray}} check the indices of a DictionaryArray? Not knowingly in detail how {{ValidateArray}} is used internally and what its purpose is, I would expect that from the name of that function, but from your response it might not? {{ValidateArray}} for a ListArray also does walk all offsets and check they are consistent. > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901954#comment-16901954 ] Antoine Pitrou commented on ARROW-6132: --- Well, {{DictionaryArray.from_arrays(safe=True)}} is really expensive: it will walk all indices and check they are within bounds. {{ValidateArray}} doesn't do such a thing. > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901945#comment-16901945 ] Joris Van den Bossche commented on ARROW-6132: -- {{DictionaryArray.from_arrays}} has a {{safe=True/False}} argument (with {{safe=True}} as default) to disable the validity checking. Although it is not exactly the same under the hood (DictionaryArray does not use the {{ValidateArray}} method), for users it is similar functionality, so I could also add such a keyword to {{ListArray.from_arrays}} as well (didn't do that yet in the PR https://github.com/apache/arrow/pull/5029). > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900260#comment-16900260 ] Uwe L. Korn commented on ARROW-6132: +1, not getting segfaults or delayed errors on Python APIs is essential. > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Minor > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900079#comment-16900079 ] Wes McKinney commented on ARROW-6132: - I agree, I think adding validation here is OK > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Minor > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900029#comment-16900029 ] Antoine Pitrou commented on ARROW-6132: --- [~wesmckinn] [~xhochy] What do you think? I think there is value in spending a few CPU cycles validating inputs in the Python APIs. > [Python] ListArray.from_arrays does not check validity of input arrays > -- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Minor > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)