[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901968#comment-16901968
 ] 

Antoine Pitrou commented on ARROW-6132:
---------------------------------------

My expectation is that {{ValidateArray}} is O(1), not O(n) in array size.

Perhaps we need a separate {{ValidateArrayData}} that digs deeper...

> [Python] ListArray.from_arrays does not check validity of input arrays
> ----------------------------------------------------------------------
>
>                 Key: ARROW-6132
>                 URL: https://issues.apache.org/jira/browse/ARROW-6132
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> <pyarrow.lib.ListArray object at 0x7fdd9c468678>
> [
>   [
>     1,
>     2
>   ],
>   [
>     3,
>     4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> <pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8>
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to