[ https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs reassigned ARROW-3766: -------------------------------------- Assignee: Krisztian Szucs > [Python] pa.Table.from_pandas doesn't use schema ordering > --------------------------------------------------------- > > Key: ARROW-3766 > URL: https://issues.apache.org/jira/browse/ARROW-3766 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Christian Thiel > Assignee: Krisztian Szucs > Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.12.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Pyarrow is sensitive to the order of the columns upon load of partitioned > Files. > With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we > can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} > object does use the ordering of pandas columns rather than the schema > columns. Furthermore it is possible to have columns in the schema but not in > the DataFrame (and hence in the resulting pa.Table). > This behaviour requires a lot of fiddling with the pandas Frame in the first > place if we like to write compatible partitioned files. Hence I argue that > for {{pa.Table.from_pandas}}, and any other comparable function, the schema > should be the principal source for the Table structure and not the columns > and the ordering in the pandas DataFrame. If I specify a schema I simply > expect that the resulting Table actually has this schema. > Here is a little example. If you remove the reordering of df2 everything > works fine: > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), > ('partition_column', pa.int32()), > ('arrays', pa.list_(pa.int32())), > ('strings', pa.string()), > ('new_column', pa.string())]) > df1 = df[df.partition_column==0] > df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']] > table1 = pa.Table.from_pandas(df1, schema=my_schema) > table2 = pa.Table.from_pandas(df2, schema=my_schema) > pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa')) > pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa')) > pd.read_parquet(PATH_PYARROW_MANUAL) > {code} > If -- This message was sent by Atlassian JIRA (v7.6.3#76005)