[jira] [Commented] (ARROW-2860) [Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read

Joris Van den Bossche (Jira) Wed, 01 Apr 2020 07:02:38 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072797#comment-17072797
 ]


Joris Van den Bossche commented on ARROW-2860:
----------------------------------------------

So several things here:

- The original root cause of this issue ({{write_to_dataset}} outputting 
parquet files with different schemas) has already been solved a while ago in 
ARROW-2891
- Of course, you can still end up with parquet files where one of the files has 
a null type by other means, and we should be able to read those:
    - We have no plans to fix this in the current ParquetDataset python 
implementation, but
    - It already works fine in the new Dataset API, and this can be used in 
ParquetDataset after ARROW-8039 is merged.
- So in the new Dataset API, basic schema normalization/evolution already 
works: nullable vs non-nullable, null -> any type promotion, addition/removal 
of a column, reordering the columns. For other types mentioned above, there are 
some open JIRAs (eg ARROW-8282 for ints, ARROW-8284 for timestamps)

> [Python][Parquet][C++] Null values in a single partition of Parquet dataset, 
> results in invalid schema on read
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2860
>                 URL: https://issues.apache.org/jira/browse/ARROW-2860
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Sam Oluwalana
>            Priority: Major
>              Labels: dataset, dataset-parquet-read, parquet
>
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> from datetime import datetime, timedelta
> def generate_data(event_type, event_id, offset=0):
>     """Generate data."""
>     now = datetime.utcnow() + timedelta(seconds=offset)
>     obj = {
>         'event_type': event_type,
>         'event_id': event_id,
>         'event_date': now.date(),
>         'foo': None,
>         'bar': u'hello',
>     }
>     if event_type == 2:
>         obj['foo'] = 1
>         obj['bar'] = u'world'
>     if event_type == 3:
>         obj['different'] = u'data'
>         obj['bar'] = u'event type 3'
>     else:
>         obj['different'] = None
>     return obj
> data = [
>     generate_data(1, 1, 1),
>     generate_data(1, 1, 3600 * 72),
>     generate_data(2, 1, 1),
>     generate_data(2, 1, 3600 * 72),
>     generate_data(3, 1, 1),
>     generate_data(3, 1, 3600 * 72),
> ]
> df = pd.DataFrame.from_records(data, index='event_id')
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path='/tmp/events', 
> partition_cols=['event_type', 'event_date'])
> dataset = pq.ParquetDataset('/tmp/events')
> table = dataset.read()
> print(table.num_rows)
> {code}
> Expected output:
> {code:python}
> 6
> {code}
> Actual:
> {code:python}
> python example_failure.py
> Traceback (most recent call last):
>   File "example_failure.py", line 43, in <module>
>     dataset = pq.ParquetDataset('/tmp/events')
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 745, in __init__
>     self.validate_schemas()
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 775, in validate_schemas
>     dataset_schema))
> ValueError: Schema in partition[event_type=2, event_date=0] 
> /tmp/events/event_type=3/event_date=2018-07-16 
> 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
> bar: string
> different: string
> foo: double
> event_id: int64
> metadata
> --------
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> vs
> bar: string
> different: null
> foo: double
> event_id: int64
> metadata
> --------
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> {code}
> Apparently what is happening is that pyarrow is interpreting the schema from 
> each of the partitions individually and the partitions for `event_type=3 / 
> event_date=*`  both have values for the column `different` whereas the other 
> columns do not. The discrepancy causes the `None` values of the other 
> partitions to be labeled as `pandas_type` `empty` instead of `unicode`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-2860) [Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read

Reply via email to