[ https://issues.apache.org/jira/browse/ARROW-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835519#comment-16835519 ]
Joris Van den Bossche edited comment on ARROW-3210 at 5/8/19 11:34 AM: ----------------------------------------------------------------------- This has been fixed in ARROW-2891 (ensuring that the schema from the full table is used for each partition) was (Author: jorisvandenbossche): This has been fixed by ARROW-2891 (ensuring that the schema from the full table is used for each partition) > [Python] Creating ParquetDataset creates partitioned ParquetFiles with > mismatched Parquet schemas > ------------------------------------------------------------------------------------------------- > > Key: ARROW-3210 > URL: https://issues.apache.org/jira/browse/ARROW-3210 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.9.0 > Environment: Ubuntu 16.04 LTS, System76 Oryx Pro > Reporter: Ying Wang > Priority: Major > Labels: parquet > Fix For: 0.14.0 > > Attachments: environment.yml, repro.csv, repro.py, repro_2.py > > > STEPS TO REPRODUCE: > 1. Create a conda environment reflecting [^environment.yml] > 2. Execute script [^repro.py], replacing various config variables to create a > ParquetDataset on S3 given [^repro.csv] > 3. Create reference of ParquetDataset using script [^repro_2.py], again > replacing various config variables. > > EXPECTED: > Reference is created correctly. > GOT: > Mismatched Arrow schemas in validate_schemas() method: > > ```python > *** ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, > Heading=1] > s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC > RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet > was different. > Record_ID: int64 > y: double > TRACKID: string > MMSI: int64 > IMO: int64 > AgeMinutes: double > SoG: double > Width: int64 > Length: int64 > Callsign: string > Destination: string > ETA: int64 > Status: string > ExtraInfo: string > TIMESTAMP: int64 > __index_level_0__: int64 > metadata > -------- > {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": > [{"na' > b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_' > b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":' > b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' > b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y' > b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' > b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T' > b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' > b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ' > b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": ' > b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' > b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name' > b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' > b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan' > b'das_type": "float64", "numpy_type": "float64", "metadata": null}' > b', {"name": "Width", "field_name": "Width", "pandas_type": "int64' > b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", ' > b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' > b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca' > b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' > b'data": null}, {"name": "Destination", "field_name": "Destination' > b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' > b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int' > b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"' > b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' > b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name' > b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"' > b', "metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMEST' > b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":' > b' null}, {"name": null, "field_name": "__index_level_0__", "panda' > b's_type": "int64", "numpy_type": "int64", "metadata": null}], "pa' > b'ndas_version": "0.21.0"}'} > vs > Record_ID: int64 > y: double > TRACKID: string > MMSI: int64 > IMO: int64 > AgeMinutes: double > SoG: double > Width: int64 > Length: int64 > Callsign: string > Destination: string > ETA: int64 > Status: string > ExtraInfo: null > TIMESTAMP: int64 > __index_level_0__: int64 > metadata > -------- > {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": > [{"na' > b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_' > b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":' > b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' > b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y' > b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' > b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T' > b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' > b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ' > b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": ' > b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' > b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name' > b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' > b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan' > b'das_type": "float64", "numpy_type": "float64", "metadata": null}' > b', {"name": "Width", "field_name": "Width", "pandas_type": "int64' > b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", ' > b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' > b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca' > b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' > b'data": null}, {"name": "Destination", "field_name": "Destination' > b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' > b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int' > b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"' > b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' > b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name' > b'": "ExtraInfo", "pandas_type": "empty", "numpy_type": "object", ' > b'"metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMESTAM' > b'P", "pandas_type": "int64", "numpy_type": "int64", "metadata": n' > b'ull}, {"name": null, "field_name": "__index_level_0__", "pandas_' > b'type": "int64", "numpy_type": "int64", "metadata": null}], "pand' > b'as_version": "0.21.0"}'} > ``` > The issue is with column *ExtraInfo*, where *pandas_type* is *unicode* in a > partitioned ParquetDatasetPiece referencing the 2nd Parquet file created, > while the ParquetDataset schema referencing the 1st Parquet file created has > *pandas_type* *empty* for that same column. -- This message was sent by Atlassian JIRA (v7.6.3#76005)