[jira] [Created] (ARROW-4538) pa.Table.from_pandas() with df.index.name != None breaks write_to_dataset()

2019-02-12 Thread Christian Thiel (JIRA)
Christian Thiel created ARROW-4538:
--

 Summary: pa.Table.from_pandas() with df.index.name != None breaks 
write_to_dataset()
 Key: ARROW-4538
 URL: https://issues.apache.org/jira/browse/ARROW-4538
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.12.0
Reporter: Christian Thiel


When using {{pa.Table.from_pandas()}} with preserve_index=True and 
dataframe.index.name!=None the prefix {{__index_level_}} is not added to the 
respective schema name. This breaks {{write_to_dataset}} with active partition 
columns.
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import os
import shutil
import pandas as pd
import numpy as np

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df['arrays'] = pd.Series(arrays)

df.index.name='ID'

table = pa.Table.from_pandas(df, preserve_index=True)
print(table.schema.names)

pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
partition_cols=['partition_column'],
preserve_index=True
   )
{code}
Removing {{df.index.name='ID'}} works. Also disabling {{partition_cols}} in 
{{write_to_dataset}} works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3861) ParquetDataset().read columns argument always returns partition column

2018-11-23 Thread Christian Thiel (JIRA)
Christian Thiel created ARROW-3861:
--

 Summary: ParquetDataset().read columns argument always returns 
partition column
 Key: ARROW-3861
 URL: https://issues.apache.org/jira/browse/ARROW-3861
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Christian Thiel


I just noticed that no matter which columns are specified on load of a dataset, 
the partition column is always returned. This might lead to strange behaviour, 
as the resulting dataframe has more than the expected columns:
{code}
import dask as da
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
   ('partition_column', pa.int32()),
   ('arrays', pa.list_(pa.int32())),
   ('strings', pa.string()),
   ('new_column', pa.string())])

table = pa.Table.from_pandas(df, schema=my_schema)
pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
partition_cols=['partition_column'])

df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
'strings']).to_pandas()
# pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
engine='pyarrow')
df_pq
{code}
df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering

2018-11-12 Thread Christian Thiel (JIRA)
Christian Thiel created ARROW-3766:
--

 Summary: pa.Table.from_pandas doesn't use schema ordering
 Key: ARROW-3766
 URL: https://issues.apache.org/jira/browse/ARROW-3766
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Christian Thiel


Pyarrow is sensitive to the order of the columns upon load of partitioned Files.
With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can 
apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object 
does use the ordering of pandas columns rather than the schema columns. 
Furthermore it is possible to have columns in the schema but not in the 
DataFrame (and hence in the resulting pa.Table).

This behaviour requires a lot of fiddling with the pandas Frame in the first 
place if we like to write compatible partitioned files. Hence I argue that for 
{{pa.Table.from_pandas}}, and any other comparable function, the schema should 
be the principal source for the Table structure and not the columns and the 
ordering in the pandas DataFrame. If I specify a schema I simply expect that 
the resulting Table actually has this schema.

Here is a little example. If you remove the reordering of df2 everything works 
fine:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
   ('partition_column', pa.int32()),
   ('arrays', pa.list_(pa.int32())),
   ('strings', pa.string()),
   ('new_column', pa.string())])

df1 = df[df.partition_column==0]
df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]


table1 = pa.Table.from_pandas(df1, schema=my_schema)
table2 = pa.Table.from_pandas(df2, schema=my_schema)

pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))

pd.read_parquet(PATH_PYARROW_MANUAL)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)