[ https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reassigned ARROW-6132: -------------------------------------------- Assignee: Joris Van den Bossche > [Python] ListArray.from_arrays does not check validity of input arrays > ---------------------------------------------------------------------- > > Key: ARROW-6132 > URL: https://issues.apache.org/jira/browse/ARROW-6132 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Joris Van den Bossche > Assignee: Joris Van den Bossche > Priority: Minor > > From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. > When creating a ListArray from offsets and values in python, there is no > validation of the offsets that it starts with 0 and ends with the length of > the array (but is that required? the docs seem to indicate that: > https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type > ("The first value in the offsets array is 0, and the last element is the > length of the values array."). > The array you get "seems" ok (the repr), but on conversion to python or > flattened arrays, things go wrong: > {code} > In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) > In [62]: a > Out[62]: > <pyarrow.lib.ListArray object at 0x7fdd9c468678> > [ > [ > 1, > 2 > ], > [ > 3, > 4 > ] > ] > In [63]: a.flatten() > Out[63]: > <pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8> > [ > 0, # <--- includes the 0 > 1, > 2, > 3, > 4 > ] > In [64]: a.to_pylist() > Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes > more elements as garbage > {code} > Calling {{validate}} manually correctly raises: > {code} > In [65]: a.validate() > ... > ArrowInvalid: Final offset invariant not equal to values length: 10!=5 > {code} > In C++ the main constructors are not safe, and as the caller you need to > ensure that the data is correct or call a safe (slower) constructor. But do > we want to use the unsafe / fast constructors without validation in Python as > default as well? Or should we do a call to {{validate}} here? > A quick search seems to indicate that `pa.Array.from_buffers` does > validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)