[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)
[ https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-7617: --- Fix Version/s: 9.0.0 (was: 8.0.0) > [Python] parquet.write_to_dataset creates empty partitions for non-observed > dictionary items (categories) > - > > Key: ARROW-7617 > URL: https://issues.apache.org/jira/browse/ARROW-7617 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Vladimir >Priority: Major > Labels: dataset, dataset-parquet-write, parquet > Fix For: 9.0.0 > > > Hello, > it looks like, views with selection along categorical column are not properly > respected. > For the following dummy dataframe: > > {code:java} > d = pd.date_range('1990-01-01', freq='D', periods=1) > vals = pd.np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > {code} > The slice by Year is saved to partitioned parquet properly: > {code:java} > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_a.parquet', > partition_cols=['Year']){code} > However, if we convert Year to pandas.Categorical - it will save the whole > original dataframe, not only slice of Year=1990: > {code:java} > x['Year'] = x['Year'].astype('category') > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_b.parquet', > partition_cols=['Year']) > {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)
[ https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessandro Molina updated ARROW-7617: - Fix Version/s: 8.0.0 (was: 7.0.0) > [Python] parquet.write_to_dataset creates empty partitions for non-observed > dictionary items (categories) > - > > Key: ARROW-7617 > URL: https://issues.apache.org/jira/browse/ARROW-7617 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Vladimir >Priority: Major > Labels: dataset, dataset-parquet-write, parquet > Fix For: 8.0.0 > > > Hello, > it looks like, views with selection along categorical column are not properly > respected. > For the following dummy dataframe: > > {code:java} > d = pd.date_range('1990-01-01', freq='D', periods=1) > vals = pd.np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > {code} > The slice by Year is saved to partitioned parquet properly: > {code:java} > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_a.parquet', > partition_cols=['Year']){code} > However, if we convert Year to pandas.Categorical - it will save the whole > original dataframe, not only slice of Year=1990: > {code:java} > x['Year'] = x['Year'].astype('category') > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_b.parquet', > partition_cols=['Year']) > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)
[ https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessandro Molina updated ARROW-7617: - Fix Version/s: (was: 6.0.0) 7.0.0 > [Python] parquet.write_to_dataset creates empty partitions for non-observed > dictionary items (categories) > - > > Key: ARROW-7617 > URL: https://issues.apache.org/jira/browse/ARROW-7617 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Vladimir >Priority: Major > Labels: dataset, dataset-parquet-write, parquet > Fix For: 7.0.0 > > > Hello, > it looks like, views with selection along categorical column are not properly > respected. > For the following dummy dataframe: > > {code:java} > d = pd.date_range('1990-01-01', freq='D', periods=1) > vals = pd.np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > {code} > The slice by Year is saved to partitioned parquet properly: > {code:java} > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_a.parquet', > partition_cols=['Year']){code} > However, if we convert Year to pandas.Categorical - it will save the whole > original dataframe, not only slice of Year=1990: > {code:java} > x['Year'] = x['Year'].astype('category') > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_b.parquet', > partition_cols=['Year']) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)
[ https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7617: --- Fix Version/s: (was: 5.0.0) 6.0.0 > [Python] parquet.write_to_dataset creates empty partitions for non-observed > dictionary items (categories) > - > > Key: ARROW-7617 > URL: https://issues.apache.org/jira/browse/ARROW-7617 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Vladimir >Priority: Major > Labels: dataset, dataset-parquet-write, parquet > Fix For: 6.0.0 > > > Hello, > it looks like, views with selection along categorical column are not properly > respected. > For the following dummy dataframe: > > {code:java} > d = pd.date_range('1990-01-01', freq='D', periods=1) > vals = pd.np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > {code} > The slice by Year is saved to partitioned parquet properly: > {code:java} > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_a.parquet', > partition_cols=['Year']){code} > However, if we convert Year to pandas.Categorical - it will save the whole > original dataframe, not only slice of Year=1990: > {code:java} > x['Year'] = x['Year'].astype('category') > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_b.parquet', > partition_cols=['Year']) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)
[ https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7617: --- Fix Version/s: (was: 4.0.0) 5.0.0 > [Python] parquet.write_to_dataset creates empty partitions for non-observed > dictionary items (categories) > - > > Key: ARROW-7617 > URL: https://issues.apache.org/jira/browse/ARROW-7617 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Vladimir >Assignee: Andrew Wieteska >Priority: Major > Labels: dataset, dataset-parquet-write, parquet > Fix For: 5.0.0 > > > Hello, > it looks like, views with selection along categorical column are not properly > respected. > For the following dummy dataframe: > > {code:java} > d = pd.date_range('1990-01-01', freq='D', periods=1) > vals = pd.np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > {code} > The slice by Year is saved to partitioned parquet properly: > {code:java} > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_a.parquet', > partition_cols=['Year']){code} > However, if we convert Year to pandas.Categorical - it will save the whole > original dataframe, not only slice of Year=1990: > {code:java} > x['Year'] = x['Year'].astype('category') > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_b.parquet', > partition_cols=['Year']) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)
[ https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wieteska updated ARROW-7617: --- Fix Version/s: (was: 3.0.0) 4.0.0 > [Python] parquet.write_to_dataset creates empty partitions for non-observed > dictionary items (categories) > - > > Key: ARROW-7617 > URL: https://issues.apache.org/jira/browse/ARROW-7617 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Vladimir >Assignee: Andrew Wieteska >Priority: Major > Labels: dataset, dataset-parquet-write, parquet > Fix For: 4.0.0 > > > Hello, > it looks like, views with selection along categorical column are not properly > respected. > For the following dummy dataframe: > > {code:java} > d = pd.date_range('1990-01-01', freq='D', periods=1) > vals = pd.np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > {code} > The slice by Year is saved to partitioned parquet properly: > {code:java} > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_a.parquet', > partition_cols=['Year']){code} > However, if we convert Year to pandas.Categorical - it will save the whole > original dataframe, not only slice of Year=1990: > {code:java} > x['Year'] = x['Year'].astype('category') > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_b.parquet', > partition_cols=['Year']) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)
[ https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wieteska updated ARROW-7617: --- Fix Version/s: 3.0.0 > [Python] parquet.write_to_dataset creates empty partitions for non-observed > dictionary items (categories) > - > > Key: ARROW-7617 > URL: https://issues.apache.org/jira/browse/ARROW-7617 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Vladimir >Assignee: Andrew Wieteska >Priority: Major > Labels: dataset, dataset-parquet-write, parquet > Fix For: 3.0.0 > > > Hello, > it looks like, views with selection along categorical column are not properly > respected. > For the following dummy dataframe: > > {code:java} > d = pd.date_range('1990-01-01', freq='D', periods=1) > vals = pd.np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > {code} > The slice by Year is saved to partitioned parquet properly: > {code:java} > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_a.parquet', > partition_cols=['Year']){code} > However, if we convert Year to pandas.Categorical - it will save the whole > original dataframe, not only slice of Year=1990: > {code:java} > x['Year'] = x['Year'].astype('category') > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_b.parquet', > partition_cols=['Year']) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)
[ https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7617: - Labels: dataset dataset-parquet-write parquet (was: dataset parquet) > [Python] parquet.write_to_dataset creates empty partitions for non-observed > dictionary items (categories) > - > > Key: ARROW-7617 > URL: https://issues.apache.org/jira/browse/ARROW-7617 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Vladimir >Assignee: Andrew Wieteska >Priority: Major > Labels: dataset, dataset-parquet-write, parquet > > Hello, > it looks like, views with selection along categorical column are not properly > respected. > For the following dummy dataframe: > > {code:java} > d = pd.date_range('1990-01-01', freq='D', periods=1) > vals = pd.np.random.randn(len(d), 4) > x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) > x['Year'] = x.index.year > {code} > The slice by Year is saved to partitioned parquet properly: > {code:java} > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_a.parquet', > partition_cols=['Year']){code} > However, if we convert Year to pandas.Categorical - it will save the whole > original dataframe, not only slice of Year=1990: > {code:java} > x['Year'] = x['Year'].astype('category') > table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) > pq.write_to_dataset(table, root_path='test_b.parquet', > partition_cols=['Year']) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)