Vladimir created ARROW-7617: ------------------------------- Summary: [Python] Slices of Dataframes with Categorical columns are not respected in write_to_dataset Key: ARROW-7617 URL: https://issues.apache.org/jira/browse/ARROW-7617 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Reporter: Vladimir
Hello, it looks like, views with selection along categorical column are not properly respected. For the following dummy dataframe: {code:java} d = pd.date_range('1990-01-01', freq='D', periods=10000) vals = pd.np.random.randn(len(d), 4) x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) x['Year'] = x.index.year {code} The slice by Year is saved to partitioned parquet properly: {code:java} table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) pq.write_to_dataset(table, root_path='test_a.parquet', partition_cols=['Year'], use_dictionary=True, compression='snappy'){code} However, if we convert Year to pandas.Categorical - it will save the whole original dataframe, not only slice of Year=1990: {code:java} x['Year'] = x['Year'].astype('category') table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'], use_dictionary=True, compression='snappy') {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)