[jira] [Created] (ARROW-4538) pa.Table.from_pandas() with df.index.name != None breaks write_to_dataset()
Christian Thiel created ARROW-4538: -- Summary: pa.Table.from_pandas() with df.index.name != None breaks write_to_dataset() Key: ARROW-4538 URL: https://issues.apache.org/jira/browse/ARROW-4538 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.0 Reporter: Christian Thiel When using {{pa.Table.from_pandas()}} with preserve_index=True and dataframe.index.name!=None the prefix {{__index_level_}} is not added to the respective schema name. This breaks {{write_to_dataset}} with active partition columns. {code:python} import pyarrow as pa import pyarrow.parquet as pq import os import shutil import pandas as pd import numpy as np PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' if os.path.exists(PATH_PYARROW_MANUAL): shutil.rmtree(PATH_PYARROW_MANUAL) os.mkdir(PATH_PYARROW_MANUAL) arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) df['arrays'] = pd.Series(arrays) df.index.name='ID' table = pa.Table.from_pandas(df, preserve_index=True) print(table.schema.names) pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, partition_cols=['partition_column'], preserve_index=True ) {code} Removing {{df.index.name='ID'}} works. Also disabling {{partition_cols}} in {{write_to_dataset}} works. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3861) ParquetDataset().read columns argument always returns partition column
Christian Thiel created ARROW-3861: -- Summary: ParquetDataset().read columns argument always returns partition column Key: ARROW-3861 URL: https://issues.apache.org/jira/browse/ARROW-3861 Project: Apache Arrow Issue Type: Bug Reporter: Christian Thiel I just noticed that no matter which columns are specified on load of a dataset, the partition column is always returned. This might lead to strange behaviour, as the resulting dataframe has more than the expected columns: {code} import dask as da import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import os import numpy as np import shutil PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' if os.path.exists(PATH_PYARROW_MANUAL): shutil.rmtree(PATH_PYARROW_MANUAL) os.mkdir(PATH_PYARROW_MANUAL) arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) strings = np.array([np.nan, np.nan, 'a', 'b']) df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) df.index.name='DPRD_ID' df['arrays'] = pd.Series(arrays) df['strings'] = pd.Series(strings) my_schema = pa.schema([('DPRD_ID', pa.int64()), ('partition_column', pa.int32()), ('arrays', pa.list_(pa.int32())), ('strings', pa.string()), ('new_column', pa.string())]) table = pa.Table.from_pandas(df, schema=my_schema) pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, partition_cols=['partition_column']) df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 'strings']).to_pandas() # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], engine='pyarrow') df_pq {code} df_pq has column `partition_column` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering
Christian Thiel created ARROW-3766: -- Summary: pa.Table.from_pandas doesn't use schema ordering Key: ARROW-3766 URL: https://issues.apache.org/jira/browse/ARROW-3766 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Christian Thiel Pyarrow is sensitive to the order of the columns upon load of partitioned Files. With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object does use the ordering of pandas columns rather than the schema columns. Furthermore it is possible to have columns in the schema but not in the DataFrame (and hence in the resulting pa.Table). This behaviour requires a lot of fiddling with the pandas Frame in the first place if we like to write compatible partitioned files. Hence I argue that for {{pa.Table.from_pandas}}, and any other comparable function, the schema should be the principal source for the Table structure and not the columns and the ordering in the pandas DataFrame. If I specify a schema I simply expect that the resulting Table actually has this schema. Here is a little example. If you remove the reordering of df2 everything works fine: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import os import numpy as np import shutil PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' if os.path.exists(PATH_PYARROW_MANUAL): shutil.rmtree(PATH_PYARROW_MANUAL) os.mkdir(PATH_PYARROW_MANUAL) arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) strings = np.array([np.nan, np.nan, 'a', 'b']) df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) df.index.name='DPRD_ID' df['arrays'] = pd.Series(arrays) df['strings'] = pd.Series(strings) my_schema = pa.schema([('DPRD_ID', pa.int64()), ('partition_column', pa.int32()), ('arrays', pa.list_(pa.int32())), ('strings', pa.string()), ('new_column', pa.string())]) df1 = df[df.partition_column==0] df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']] table1 = pa.Table.from_pandas(df1, schema=my_schema) table2 = pa.Table.from_pandas(df2, schema=my_schema) pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa')) pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa')) pd.read_parquet(PATH_PYARROW_MANUAL) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)