[jira] [Created] (ARROW-7617) [Python] Slices of Dataframes with Categorical columns are not respected in write_to_dataset

Vladimir (Jira) Mon, 20 Jan 2020 06:17:13 -0800

Vladimir created ARROW-7617:
-------------------------------

             Summary: [Python] Slices of Dataframes with Categorical columns 
are not respected in write_to_dataset
                 Key: ARROW-7617
                 URL: https://issues.apache.org/jira/browse/ARROW-7617
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
            Reporter: Vladimir



Hello,

it looks like, views with selection along categorical column are not properly 
respected.

For the following dummy dataframe:

 
{code:java}
d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
x['Year'] = x.index.year
{code}
The slice by Year is saved to partitioned parquet properly:
{code:java}
table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_a.parquet', partition_cols=['Year'],
                    use_dictionary=True, compression='snappy'){code}
However, if we convert Year to pandas.Categorical - it will save the whole 
original dataframe, not only slice of Year=1990:
{code:java}
x['Year'] = x['Year'].astype('category')

table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'],
                    use_dictionary=True, compression='snappy')
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7617) [Python] Slices of Dataframes with Categorical columns are not respected in write_to_dataset

Reply via email to