[ https://issues.apache.org/jira/browse/ARROW-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gil Forsyth updated ARROW-17582: -------------------------------- Description: in [ibis|https://github.com/ibis-project/ibis] we're interested in offering query results as a record batch – some of the data we're starting with is coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and {{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and {{{}sqlalchemy.engine.row.RowMapping{}}}, respectively. The checks in `python_to_arrow.cc` are strict enough that these can't be readily dumped into an `array` without first calling, e.g. `tuple` on the individual rows of the results. {code:java} In [168]: batch[:5] Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())]) In [170]: pa.array(batch[:5], type=pa_schema) --------------------------------------------------------------------------- ArrowTypeError Traceback (most recent call last) Input In [170], in <cell line: 1>() ----> 1 pa.array(batch[:5], type=pa_schema) File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status() ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items) /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast<int64_t>(i), &keep_going) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size) {code} vs {code:java} In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) Out[171]: <pyarrow.lib.StructArray object at 0x7fd4fb52d660> -- is_valid: all not null -- child 0 type: int32 [ 1, 1, 1, 1, 1 ] -- child 1 type: int32 [ 2173, 943, 892, 30, 337 ]{code} To avoid the overhead of this extra conversion, maybe there are some checks that aren't explicit python type-checks that we can rely on? was: in [ibis|https://github.com/ibis-project/ibis] we're interested in offering query results as a record batch – some of the data we're starting with is coming back from a `sqlalchemy.cursor` which _look_ like `tuple`s and `dict`s but are actually `sqlalchemy.engine.row.LegacyRow` and `sqlalchemy.engine.row.RowMapping`, respectively. The checks in `python_to_arrow.cc` are strict enough that these can't be readily dumped into an `array` without first calling, e.g. `tuple` on the individual rows of the results. {code:java} In [168]: batch[:5] Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())]) In [170]: pa.array(batch[:5], type=pa_schema) --------------------------------------------------------------------------- ArrowTypeError Traceback (most recent call last) Input In [170], in <cell line: 1>() ----> 1 pa.array(batch[:5], type=pa_schema) File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status() ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items) /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast<int64_t>(i), &keep_going) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size) {code} vs {code:java} In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) Out[171]: <pyarrow.lib.StructArray object at 0x7fd4fb52d660> -- is_valid: all not null -- child 0 type: int32 [ 1, 1, 1, 1, 1 ] -- child 1 type: int32 [ 2173, 943, 892, 30, 337 ]{code} To avoid the overhead of this extra conversion, maybe there are some checks that aren't explicit python type-checks that we can rely on? > Relax / extend type checking for pyarrow array creation > ------------------------------------------------------- > > Key: ARROW-17582 > URL: https://issues.apache.org/jira/browse/ARROW-17582 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Gil Forsyth > Priority: Major > > in [ibis|https://github.com/ibis-project/ibis] we're interested in offering > query results as a record batch – some of the data we're starting with is > coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and > {{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and > {{{}sqlalchemy.engine.row.RowMapping{}}}, respectively. > > The checks in `python_to_arrow.cc` are strict enough that these can't be > readily dumped into an `array` without first calling, e.g. `tuple` on the > individual rows of the results. > > > {code:java} > In [168]: batch[:5] > Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] > In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", > pa.int32())]) > In [170]: pa.array(batch[:5], type=pa_schema) > --------------------------------------------------------------------------- > ArrowTypeError Traceback (most recent call last) > Input In [170], in <cell line: 1>() > ----> 1 pa.array(batch[:5], type=pa_schema) > File > /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, > in pyarrow.lib.array() > File > /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, > in pyarrow.lib._sequence_to_array() > File > /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, > in pyarrow.lib.pyarrow_internal_check_status() > File > /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, > in pyarrow.lib.check_status() > ArrowTypeError: Could not convert 1 with type int: was expecting tuple of > (key, value) pair > /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 > GetKeyValuePair(items, i) > /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 > InferKeyKind(items) > /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, > static_cast<int64_t>(i), &keep_going) > /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 > converter->Extend(seq, size) > {code} > vs > {code:java} > In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) > Out[171]: > <pyarrow.lib.StructArray object at 0x7fd4fb52d660> > -- is_valid: all not null > -- child 0 type: int32 > [ > 1, > 1, > 1, > 1, > 1 > ] > -- child 1 type: int32 > [ > 2173, > 943, > 892, > 30, > 337 > ]{code} > To avoid the overhead of this extra conversion, maybe there are some checks > that aren't explicit python type-checks that we can rely on? -- This message was sent by Atlassian Jira (v8.20.10#820010)