[ https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072797#comment-17072797 ]
Joris Van den Bossche commented on ARROW-2860: ---------------------------------------------- So several things here: - The original root cause of this issue ({{write_to_dataset}} outputting parquet files with different schemas) has already been solved a while ago in ARROW-2891 - Of course, you can still end up with parquet files where one of the files has a null type by other means, and we should be able to read those: - We have no plans to fix this in the current ParquetDataset python implementation, but - It already works fine in the new Dataset API, and this can be used in ParquetDataset after ARROW-8039 is merged. - So in the new Dataset API, basic schema normalization/evolution already works: nullable vs non-nullable, null -> any type promotion, addition/removal of a column, reordering the columns. For other types mentioned above, there are some open JIRAs (eg ARROW-8282 for ints, ARROW-8284 for timestamps) > [Python][Parquet][C++] Null values in a single partition of Parquet dataset, > results in invalid schema on read > -------------------------------------------------------------------------------------------------------------- > > Key: ARROW-2860 > URL: https://issues.apache.org/jira/browse/ARROW-2860 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Sam Oluwalana > Priority: Major > Labels: dataset, dataset-parquet-read, parquet > > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > from datetime import datetime, timedelta > def generate_data(event_type, event_id, offset=0): > """Generate data.""" > now = datetime.utcnow() + timedelta(seconds=offset) > obj = { > 'event_type': event_type, > 'event_id': event_id, > 'event_date': now.date(), > 'foo': None, > 'bar': u'hello', > } > if event_type == 2: > obj['foo'] = 1 > obj['bar'] = u'world' > if event_type == 3: > obj['different'] = u'data' > obj['bar'] = u'event type 3' > else: > obj['different'] = None > return obj > data = [ > generate_data(1, 1, 1), > generate_data(1, 1, 3600 * 72), > generate_data(2, 1, 1), > generate_data(2, 1, 3600 * 72), > generate_data(3, 1, 1), > generate_data(3, 1, 3600 * 72), > ] > df = pd.DataFrame.from_records(data, index='event_id') > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, root_path='/tmp/events', > partition_cols=['event_type', 'event_date']) > dataset = pq.ParquetDataset('/tmp/events') > table = dataset.read() > print(table.num_rows) > {code} > Expected output: > {code:python} > 6 > {code} > Actual: > {code:python} > python example_failure.py > Traceback (most recent call last): > File "example_failure.py", line 43, in <module> > dataset = pq.ParquetDataset('/tmp/events') > File > "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", > line 745, in __init__ > self.validate_schemas() > File > "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", > line 775, in validate_schemas > dataset_schema)) > ValueError: Schema in partition[event_type=2, event_date=0] > /tmp/events/event_type=3/event_date=2018-07-16 > 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different. > bar: string > different: string > foo: double > event_id: int64 > metadata > -------- > {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], > "columns": [{"metadata": null, "field_name": "bar", "name": "bar", > "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, > "field_name": "different", "name": "different", "numpy_type": "object", > "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": > "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, > "field_name": "event_id", "name": "event_id", "numpy_type": "int64", > "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": > null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'} > vs > bar: string > different: null > foo: double > event_id: int64 > metadata > -------- > {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], > "columns": [{"metadata": null, "field_name": "bar", "name": "bar", > "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, > "field_name": "different", "name": "different", "numpy_type": "object", > "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": > "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, > "field_name": "event_id", "name": "event_id", "numpy_type": "int64", > "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": > null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'} > {code} > Apparently what is happening is that pyarrow is interpreting the schema from > each of the partitions individually and the partitions for `event_type=3 / > event_date=*` both have values for the column `different` whereas the other > columns do not. The discrepancy causes the `None` values of the other > partitions to be labeled as `pandas_type` `empty` instead of `unicode`. -- This message was sent by Atlassian Jira (v8.3.4#803005)