[ 
https://issues.apache.org/jira/browse/ARROW-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347226#comment-17347226
 ] 

Sergey Mozharov commented on ARROW-12609:
-----------------------------------------

[~jorisvandenbossche] My original thinking (which I am questioning now) was 
based on the following thoughts/assumptions:
 * All properties of {{ListScalar}}s and the concept of "validity" (null or not 
null) are independent. That is, the {{ListScalar}} API could guarantee that all 
instances will have length, will be iterable, etc - with some not perfect but 
reasonable behavior assigned to null scalars
 * the benefit of this approach is that users can write simpler code with 
pyarrow by not having to worry about handling exceptions when a Null 
{{ListScalar}} is encountered, when the presence/absence of null scalars is not 
important for what the user is trying to achieve
 * in cases when validity of scalars is important, {{is_valid}} property could 
be used to tell apart an empty {{ListScalar}} from a Null {{ListScalar}}

??you could also argue that a missing list scalar has "no defined length"??

Agreed. This argument makes perfect sense if we think about Null scalars as 
undefined values (it can be anything, we just don't know). In this case a 
dedicated error may be needed to communicate this. I think {{AttributeError}} 
would be confusing here because {{hasattr(null_scalar, '__len__')}} returns 
True. {{TypeError}} definitely does not seem right.

I think pyarrow API consistency is probably the most important criterion. 
Assigning length 0 to a Null {{ListScalar}} would make API inconsistent with 
the behavior of pyarrow compute kernel. Raising the right kind of error seems 
like a reasonable solution because the root cause is that Python does not 
support undefined values.

If the arrow developers prefer this direction, then I hope the issue can be 
resolved in pandas. My use case is integration of list-like and struct-like 
arrow arrays with pandas Extension Arrays API. I believe this is a very 
powerful integration that deserves some attention. At the pandas side the 
problem seems to be that pandas attempts to analyze internal structure of 
scalars, and the non-standard behavior of arrow Null scalars breaks some 
important assumptions. I created a [pandas issue 
41377|https://github.com/pandas-dev/pandas/issues/41377] related to this with a 
concrete example.

> [Python] TypeError when accessing length of an invalid ListScalar
> -----------------------------------------------------------------
>
>                 Key: ARROW-12609
>                 URL: https://issues.apache.org/jira/browse/ARROW-12609
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0, 4.0.0
>         Environment: Windows 10
> python=3.9.2
> pyarrow=4.0.0 (3.0.0 has the same behavior)
>            Reporter: Sergey Mozharov
>            Priority: Major
>
> For List-like data types, the scalar corresponding to a missing value has 
> '___len___' attribute, but TypeError is raised when it is accessed
> {code:java}
> import pyarrow as pa
> data_type = pa.list_(pa.struct([
>     ('a', pa.int64()),
>     ('b', pa.bool_())
> ]))
> data = [[{'a': 1, 'b': False}, {'a': 2, 'b': True}], None]
> arr = pa.array(data, type=data_type)
> missing_scalar = arr[1]  # <pyarrow.ListScalar: None>
> assert hasattr(missing_scalar, '__len__')
> assert len(missing_scalar) == 0  # --> TypeError: object of type 'NoneType' 
> has no len()
> {code}
> Expected behavior: length is expected to be 0.
> This issue causes several pandas unit tests to fail when an ExtensionArray 
> backed by arrow array with this data type is built.
> This behavior is also inconsistent with a similar example where the data type 
> is a struct:
> {code:java}
> import pyarrow as pa
> data_type = pa.struct([
>     ('a', pa.int64()),
>     ('b', pa.bool_())
> ])
> data = [{'a': 1, 'b': False}, None]
> arr = pa.array(data, type=data_type)
> missing_scalar = arr[1]  # <pyarrow.StructScalar: None>
> assert hasattr(missing_scalar, '__len__')
> assert len(missing_scalar) == 0  # Ok
> {code}
>  In this second example the TypeError is not raised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to