[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Fix Version/s: 1.0.0 > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.9.0 >Reporter: Uwe Korn >Priority: Major > Labels: dataset, dataset-parquet-read, parquet > Fix For: 1.0.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-2659: - Labels: dataset dataset-parquet-read parquet (was: dataset parquet) > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.9.0 >Reporter: Uwe Korn >Priority: Major > Labels: dataset, dataset-parquet-read, parquet > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Fix Version/s: (was: 0.16.0) > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.9.0 >Reporter: Uwe Korn >Priority: Major > Labels: dataset, parquet > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Component/s: C++ > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: dataset, parquet > Fix For: 1.0.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: dataset, parquet > Fix For: 0.15.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Labels: dataset parquet (was: parquet) > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: dataset, parquet > Fix For: 0.14.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Fix Version/s: (was: 0.13.0) 0.14.0 > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > Fix For: 0.14.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Fix Version/s: (was: 0.12.0) 0.13.0 > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Labels: parquet (was: beginner) > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Fix Version/s: (was: 0.11.0) 0.12.0 > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 0.12.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aldrin updated ARROW-2659: -- Attachment: read_parquet_dataset.error.read_table.txt read_parquet_dataset.error.read_table.novalidation.txt > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 0.11.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-2659: --- Labels: beginner (was: ) > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 0.11.0 > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)