[ https://issues.apache.org/jira/browse/ARROW-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347226#comment-17347226 ]
Sergey Mozharov commented on ARROW-12609: ----------------------------------------- [~jorisvandenbossche] My original thinking (which I am questioning now) was based on the following thoughts/assumptions: * All properties of {{ListScalar}}s and the concept of "validity" (null or not null) are independent. That is, the {{ListScalar}} API could guarantee that all instances will have length, will be iterable, etc - with some not perfect but reasonable behavior assigned to null scalars * the benefit of this approach is that users can write simpler code with pyarrow by not having to worry about handling exceptions when a Null {{ListScalar}} is encountered, when the presence/absence of null scalars is not important for what the user is trying to achieve * in cases when validity of scalars is important, {{is_valid}} property could be used to tell apart an empty {{ListScalar}} from a Null {{ListScalar}} ??you could also argue that a missing list scalar has "no defined length"?? Agreed. This argument makes perfect sense if we think about Null scalars as undefined values (it can be anything, we just don't know). In this case a dedicated error may be needed to communicate this. I think {{AttributeError}} would be confusing here because {{hasattr(null_scalar, '__len__')}} returns True. {{TypeError}} definitely does not seem right. I think pyarrow API consistency is probably the most important criterion. Assigning length 0 to a Null {{ListScalar}} would make API inconsistent with the behavior of pyarrow compute kernel. Raising the right kind of error seems like a reasonable solution because the root cause is that Python does not support undefined values. If the arrow developers prefer this direction, then I hope the issue can be resolved in pandas. My use case is integration of list-like and struct-like arrow arrays with pandas Extension Arrays API. I believe this is a very powerful integration that deserves some attention. At the pandas side the problem seems to be that pandas attempts to analyze internal structure of scalars, and the non-standard behavior of arrow Null scalars breaks some important assumptions. I created a [pandas issue 41377|https://github.com/pandas-dev/pandas/issues/41377] related to this with a concrete example. > [Python] TypeError when accessing length of an invalid ListScalar > ----------------------------------------------------------------- > > Key: ARROW-12609 > URL: https://issues.apache.org/jira/browse/ARROW-12609 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 3.0.0, 4.0.0 > Environment: Windows 10 > python=3.9.2 > pyarrow=4.0.0 (3.0.0 has the same behavior) > Reporter: Sergey Mozharov > Priority: Major > > For List-like data types, the scalar corresponding to a missing value has > '___len___' attribute, but TypeError is raised when it is accessed > {code:java} > import pyarrow as pa > data_type = pa.list_(pa.struct([ > ('a', pa.int64()), > ('b', pa.bool_()) > ])) > data = [[{'a': 1, 'b': False}, {'a': 2, 'b': True}], None] > arr = pa.array(data, type=data_type) > missing_scalar = arr[1] # <pyarrow.ListScalar: None> > assert hasattr(missing_scalar, '__len__') > assert len(missing_scalar) == 0 # --> TypeError: object of type 'NoneType' > has no len() > {code} > Expected behavior: length is expected to be 0. > This issue causes several pandas unit tests to fail when an ExtensionArray > backed by arrow array with this data type is built. > This behavior is also inconsistent with a similar example where the data type > is a struct: > {code:java} > import pyarrow as pa > data_type = pa.struct([ > ('a', pa.int64()), > ('b', pa.bool_()) > ]) > data = [{'a': 1, 'b': False}, None] > arr = pa.array(data, type=data_type) > missing_scalar = arr[1] # <pyarrow.StructScalar: None> > assert hasattr(missing_scalar, '__len__') > assert len(missing_scalar) == 0 # Ok > {code} > In this second example the TypeError is not raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)