Joris Van den Bossche created ARROW-5655:
--------------------------------------------

             Summary: [Python] Table.from_pydict/from_arrays not using types in 
specified schema correctly 
                 Key: ARROW-5655
                 URL: https://issues.apache.org/jira/browse/ARROW-5655
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Joris Van den Bossche


Example with {{from_pydict}} (from 
https://github.com/apache/arrow/pull/4601#issuecomment-503676534):

{code:python}
In [15]: table = pa.Table.from_pydict(
    ...:     {'a': [1, 2, 3], 'b': [3, 4, 5]},
    ...:     schema=pa.schema([('a', pa.int64()), ('c', pa.int32())]))

In [16]: table
Out[16]: 
pyarrow.Table
a: int64
c: int32

In [17]: table.to_pandas()
Out[17]: 
   a  c
0  1  3
1  2  0
2  3  4
{code}

Note that the specified schema has 1) different column names and 2) has a 
non-default type (int32 vs int64) which leads to corrupted values.

This is partly due to {{Table.from_pydict}} not using the type information in 
the schema to convert the dictionary items to pyarrow arrays. But then it is 
also {{Table.from_arrays}} that is not correctly casting the arrays to another 
dtype if the schema specifies as such.

Additional question for {{Table.pydict}} is whether it actually should override 
the 'b' key from the dictionary as column 'c' as defined in the schema (this 
behaviour depends on the order of the dictionary, which is not guaranteed below 
python 3.6).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to