[ https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479790#comment-17479790 ]
nero commented on ARROW-14422: ------------------------------ Hi there, I face a related issue when I write a parquet file by PyArrow. In the old version of Hive, it can only recognize the timestamp type stored in INT96, so I use table.write_to_data with `use_deprecated_int96_timestamps=True` option to save the parquet file. But the hive SQL will skip timestamp conversion when the metadata of parquet file is not created_by "parquet-mr". [hive/ParquetRecordReaderBase.java at f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139] So I have to save the timestamp columns with timezone info. But when pyarrow.parquet read from a dir which contains parquets created by both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for parquet-mr file. Maybe PyArrow can expose the created_by option? Or handle timestamp type with timezone which files created by parquet-mr? > [Python] Allow parquet::WriterProperties::created_by to be set via > pyarrow.ParquetWriter for compatibility with older parquet-mr > -------------------------------------------------------------------------------------------------------------------------------- > > Key: ARROW-14422 > URL: https://issues.apache.org/jira/browse/ARROW-14422 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Kevin > Priority: Major > > have a couple of files and using pyarrow.table (0.17) > to save it as parquet on disk (parquet version 1.4) > colums > id : string > val : string > *table = pa.Table.from_pandas(df)* > *pq.write_table(table, "df.parquet", version='1.0', flavor='spark', > write_statistics=True, )* > However, Hive and Spark does not recognize the parquet version: > {{org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version > ((.*) )?(build ?(.*))}} > \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}} > \{{ at > org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}} > \{{ at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}} > > +*It seems related to this issue:*+ > It appears you've encountered PARQUET-349 which was fixed in 2015 before > Arrow was even started. The underlying C++ code does allow this > {{created_by}} field to be customized > [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249] > but the python wrapper does not expose this > [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360]. > > > *+EDIT Add infos+* > Current python wrapper does NOT expose : created_by builder (when writing > parquet on disk) > [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361] > > But, this is available in CPP version: > [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249] > [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320] > > This creates an issue when Hadoop parquet reader reads this pyarrow parquet > file: > > > +*SO Question here:*+ > > [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140] > -- This message was sent by Atlassian Jira (v8.20.1#820001)