[ https://issues.apache.org/jira/browse/PARQUET-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330923#comment-17330923 ]
ASF GitHub Bot commented on PARQUET-2027: ----------------------------------------- shangxinli merged pull request #896: URL: https://github.com/apache/parquet-mr/pull/896 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Merging parquet files created in 1.11.1 not possible using 1.12.0 > ------------------------------------------------------------------ > > Key: PARQUET-2027 > URL: https://issues.apache.org/jira/browse/PARQUET-2027 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Affects Versions: 1.12.0 > Reporter: Matthew M > Assignee: Gabor Szadovszky > Priority: Major > > I have parquet files created using 1.11.1. In the process I join two files > (with the same schema) into a one output file. I create Hadoop writer: > {code:scala} > val hadoopWriter = new ParquetFileWriter( > HadoopOutputFile.fromPath( > new Path(outputPath.toString), > new Configuration() > ), outputSchema, Mode.OVERWRITE, > 8 * 1024 * 1024, > 2097152, > DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, > DEFAULT_STATISTICS_TRUNCATE_LENGTH, > DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED > ) > hadoopWriter.start() > {code} > and try to append one file into another: > {code:scala} > hadoopWriter.appendFile(HadoopInputFile.fromPath(new Path(file), new > Configuration())) > {code} > Everything works on 1.11.1. But when I've switched to 1.12.0 it fails with > that error: > {code:scala} > STDERR: Exception in thread "main" java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' > was not found in serialized data! Struct: > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4 > at org.apache.parquet.format.Util.read(Util.java:365) > at org.apache.parquet.format.Util.readPageHeader(Util.java:132) > at org.apache.parquet.format.Util.readPageHeader(Util.java:127) > at org.apache.parquet.hadoop.Offsets.readDictionaryPageSize(Offsets.java:75) > at org.apache.parquet.hadoop.Offsets.getOffsets(Offsets.java:58) > at > org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroup(ParquetFileWriter.java:998) > at > org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroups(ParquetFileWriter.java:918) > at > org.apache.parquet.hadoop.ParquetFileReader.appendTo(ParquetFileReader.java:888) > at > org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:895) > at [...] > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'uncompressed_page_size' was not found in serialized data! > Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4 > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108) > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019) > at org.apache.parquet.format.PageHeader.read(PageHeader.java:896) > at org.apache.parquet.format.Util.read(Util.java:362) > ... 14 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)