[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2020-04-10 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Fix Version/s: 1.0.0

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.9.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: dataset, dataset-parquet-read, parquet
> Fix For: 1.0.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2020-03-12 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-2659:
-
Labels: dataset dataset-parquet-read parquet  (was: dataset parquet)

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.9.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: dataset, dataset-parquet-read, parquet
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2020-01-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Fix Version/s: (was: 0.16.0)

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.9.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: dataset, parquet
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2019-08-21 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Component/s: C++

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 1.0.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2019-06-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 0.15.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2019-06-12 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Labels: dataset parquet  (was: parquet)

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 0.14.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2019-03-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2019-01-09 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Fix Version/s: (was: 0.12.0)
   0.13.0

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Labels: parquet  (was: beginner)

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2018-09-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Fix Version/s: (was: 0.11.0)
   0.12.0

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.12.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2018-06-01 Thread Aldrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldrin updated ARROW-2659:
--
Attachment: read_parquet_dataset.error.read_table.txt
read_parquet_dataset.error.read_table.novalidation.txt

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.11.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2018-06-01 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-2659:
---
Labels: beginner  (was: )

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.11.0
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)