[ https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reopened ARROW-7617: ------------------------------------------ > [Python] Slices of Dataframes with Categorical columns are not respected in > write_to_dataset > -------------------------------------------------------------------------------------------- > > Key: ARROW-7617 > URL: https://issues.apache.org/jira/browse/ARROW-7617 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.1 > Reporter: Vladimir > Priority: Major > > Hello, > it looks like, views with selection along categorical column are not properly > respected. > For the following dummy dataframe: > > {code:java} > d = pd.date_range('1990-01-01', freq='D', periods=10000) > vals = pd.np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > {code} > The slice by Year is saved to partitioned parquet properly: > {code:java} > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_a.parquet', > partition_cols=['Year']){code} > However, if we convert Year to pandas.Categorical - it will save the whole > original dataframe, not only slice of Year=1990: > {code:java} > x['Year'] = x['Year'].astype('category') > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_b.parquet', > partition_cols=['Year']) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)