[ 
https://issues.apache.org/jira/browse/ARROW-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Forsyth updated ARROW-17582:
--------------------------------
    Description: 
in [ibis|https://github.com/ibis-project/ibis] we're interested in offering 
query results as a record batch – some of the data we're starting with is 
coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and 
{{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and 
{{{}sqlalchemy.engine.row.RowMapping{}}}, respectively.

 

The checks in `python_to_arrow.cc` are strict enough that these can't be 
readily dumped into an `array` without first calling, e.g. `tuple` on the 
individual rows of the results.

 

 
{code:java}
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in <cell line: 1>()
----> 1 pa.array(batch[:5], type=pa_schema)
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
 in pyarrow.lib.array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
 in pyarrow.lib._sequence_to_array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
 in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, 
value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
static_cast<int64_t>(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
converter->Extend(seq, size)
{code}
vs
{code:java}
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 
<pyarrow.lib.StructArray object at 0x7fd4fb52d660>
-- is_valid: all not null
-- child 0 type: int32
  [
    1,
    1,
    1,
    1,
    1
  ]
-- child 1 type: int32
  [
    2173,
    943,
    892,
    30,
    337
  ]{code}
To avoid the overhead of this extra conversion, maybe there are some checks 
that aren't explicit python type-checks that we can rely on?

  was:
in [ibis|https://github.com/ibis-project/ibis] we're interested in offering 
query results as a record batch – some of the data we're starting with is 
coming back from a `sqlalchemy.cursor` which _look_ like `tuple`s and `dict`s 
but are actually `sqlalchemy.engine.row.LegacyRow` and 
`sqlalchemy.engine.row.RowMapping`, respectively.

 

The checks in `python_to_arrow.cc` are strict enough that these can't be 
readily dumped into an `array` without first calling, e.g. `tuple` on the 
individual rows of the results.

 

 
{code:java}
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in <cell line: 1>()
----> 1 pa.array(batch[:5], type=pa_schema)
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
 in pyarrow.lib.array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
 in pyarrow.lib._sequence_to_array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
 in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, 
value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
static_cast<int64_t>(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
converter->Extend(seq, size)
{code}
vs
{code:java}
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 
<pyarrow.lib.StructArray object at 0x7fd4fb52d660>
-- is_valid: all not null
-- child 0 type: int32
  [
    1,
    1,
    1,
    1,
    1
  ]
-- child 1 type: int32
  [
    2173,
    943,
    892,
    30,
    337
  ]{code}
To avoid the overhead of this extra conversion, maybe there are some checks 
that aren't explicit python type-checks that we can rely on?


> Relax / extend type checking for pyarrow array creation
> -------------------------------------------------------
>
>                 Key: ARROW-17582
>                 URL: https://issues.apache.org/jira/browse/ARROW-17582
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Gil Forsyth
>            Priority: Major
>
> in [ibis|https://github.com/ibis-project/ibis] we're interested in offering 
> query results as a record batch – some of the data we're starting with is 
> coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and 
> {{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and 
> {{{}sqlalchemy.engine.row.RowMapping{}}}, respectively.
>  
> The checks in `python_to_arrow.cc` are strict enough that these can't be 
> readily dumped into an `array` without first calling, e.g. `tuple` on the 
> individual rows of the results.
>  
>  
> {code:java}
> In [168]: batch[:5]
> Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
> In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
> pa.int32())])
> In [170]: pa.array(batch[:5], type=pa_schema)
> ---------------------------------------------------------------------------
> ArrowTypeError                            Traceback (most recent call last)
> Input In [170], in <cell line: 1>()
> ----> 1 pa.array(batch[:5], type=pa_schema)
> File 
> /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
>  in pyarrow.lib.array()
> File 
> /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
>  in pyarrow.lib._sequence_to_array()
> File 
> /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
>  in pyarrow.lib.pyarrow_internal_check_status()
> File 
> /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
>  in pyarrow.lib.check_status()
> ArrowTypeError: Could not convert 1 with type int: was expecting tuple of 
> (key, value) pair
> /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
> GetKeyValuePair(items, i)
> /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
> InferKeyKind(items)
> /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
> static_cast<int64_t>(i), &keep_going)
> /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
> converter->Extend(seq, size)
> {code}
> vs
> {code:java}
> In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
> Out[171]: 
> <pyarrow.lib.StructArray object at 0x7fd4fb52d660>
> -- is_valid: all not null
> -- child 0 type: int32
>   [
>     1,
>     1,
>     1,
>     1,
>     1
>   ]
> -- child 1 type: int32
>   [
>     2173,
>     943,
>     892,
>     30,
>     337
>   ]{code}
> To avoid the overhead of this extra conversion, maybe there are some checks 
> that aren't explicit python type-checks that we can rely on?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to