[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405183#comment-17405183 ]
Nemon Lou edited comment on PARQUET-2078 at 8/26/21, 12:13 PM: --------------------------------------------------------------- Root cause analyze: The file offset added in parquet 1.12.0 can go wrong under certain conditions. https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L580 When using ParquetRecordReader with blocks passed in, the wrongly setted file offset causes filtering out some blocks due to offset missmatch. https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1264 When file offset can go wrong? Here are writter debug logs to make it clearfy. testing : currentChunkFirstDataPage offset4 testing : currentChunkFirstDataPage offset12243647 testing : currentChunkFirstDataPage offset42848491 testing : currentChunkDictionaryPageOffset offset54810911 testing : currentChunkFirstDataPage offset54868535 testing : currentChunkFirstDataPage offset57421932 testing : currentChunkDictionaryPageOffset offset69665577 testing : currentChunkFirstDataPage offset69694809 testing : currentChunkDictionaryPageOffset offset72063808 testing : currentChunkFirstDataPage offset72093040 testing : currentChunkDictionaryPageOffset offset74461441 testing : currentChunkFirstDataPage offset74461508 testing : currentChunkDictionaryPageOffset offset75041119 testing : currentChunkFirstDataPage offset75092758 testing : currentChunkDictionaryPageOffset offset77575161 testing : currentChunkFirstDataPage offset77626525 testing : currentChunkDictionaryPageOffset offset80116424 testing : currentChunkFirstDataPage offset80116456 testing : currentChunkDictionaryPageOffset offset80505206 testing : currentChunkFirstDataPage offset80505351 testing : currentChunkDictionaryPageOffset offset81581705 testing : currentChunkFirstDataPage offset81581772 testing : currentChunkDictionaryPageOffset offset82473442 testing : currentChunkFirstDataPage offset82473740 testing : currentChunkDictionaryPageOffset offset83918856 testing : currentChunkFirstDataPage offset83921564 testing : currentChunkDictionaryPageOffset offset85457651 testing : currentChunkFirstDataPage offset85457674 testing : currentChunkFirstDataPage offset85460523 testing : currentChunkDictionaryPageOffset offset{color:red}132143159{color} testing : currentChunkFirstDataPage offset132146109 testing :block offset: 4 testing : currentChunkFirstDataPage offset133961161 testing : currentChunkFirstDataPage offset144992321 testing : currentChunkFirstDataPage offset172566390 testing : currentChunkDictionaryPageOffset offset183343431 testing : currentChunkFirstDataPage offset183401055 testing : currentChunkFirstDataPage offset185701717 testing : currentChunkDictionaryPageOffset offset196732869 testing : currentChunkFirstDataPage offset196762101 testing : currentChunkDictionaryPageOffset offset198896490 testing : currentChunkFirstDataPage offset198925722 testing : currentChunkDictionaryPageOffset offset201059822 testing : currentChunkFirstDataPage offset201059889 testing : currentChunkDictionaryPageOffset offset201582088 testing : currentChunkFirstDataPage offset201633695 testing : currentChunkDictionaryPageOffset offset203869258 testing : currentChunkFirstDataPage offset203920622 testing : currentChunkDictionaryPageOffset offset206163685 testing : currentChunkFirstDataPage offset206163718 testing : currentChunkDictionaryPageOffset offset206513919 testing : currentChunkFirstDataPage offset206514064 testing : currentChunkDictionaryPageOffset offset207484483 testing : currentChunkFirstDataPage offset207484550 testing : currentChunkDictionaryPageOffset offset208288402 testing : currentChunkFirstDataPage offset208288700 testing : currentChunkDictionaryPageOffset offset209591541 testing : currentChunkFirstDataPage offset209594249 testing : currentChunkDictionaryPageOffset offset210978198 testing : currentChunkFirstDataPage offset210978221 testing : currentChunkFirstDataPage offset210980774 testing : currentChunkDictionaryPageOffset offset253052539 testing : currentChunkFirstDataPage offset253055489 testing :block offset: 133961161 testing : set File_offset for rowgroup. with position: 4 testing : set File_offset for rowgroup. with position: {color:red}132143159{color} Notice that the second file offset 132143159 is wrong(133961161 is expected), which is the last column's ChunkDictionaryPageOffset in the first rowgroup. was (Author: nemon): Root cause analyze: The file offset added in parquet 1.12.0 can go wrong under certain conditions. https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L580 When using ParquetRecordReader with blocks passed in, the wrongly setted file offset causes filtering out some blocks due to offset missmatch. https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1264 When file offset can go wrong? Here are writter debug logs to make it clearfy. {noformat} testing : currentChunkFirstDataPage offset4 testing : currentChunkFirstDataPage offset12243647 testing : currentChunkFirstDataPage offset42848491 testing : currentChunkDictionaryPageOffset offset54810911 testing : currentChunkFirstDataPage offset54868535 testing : currentChunkFirstDataPage offset57421932 testing : currentChunkDictionaryPageOffset offset69665577 testing : currentChunkFirstDataPage offset69694809 testing : currentChunkDictionaryPageOffset offset72063808 testing : currentChunkFirstDataPage offset72093040 testing : currentChunkDictionaryPageOffset offset74461441 testing : currentChunkFirstDataPage offset74461508 testing : currentChunkDictionaryPageOffset offset75041119 testing : currentChunkFirstDataPage offset75092758 testing : currentChunkDictionaryPageOffset offset77575161 testing : currentChunkFirstDataPage offset77626525 testing : currentChunkDictionaryPageOffset offset80116424 testing : currentChunkFirstDataPage offset80116456 testing : currentChunkDictionaryPageOffset offset80505206 testing : currentChunkFirstDataPage offset80505351 testing : currentChunkDictionaryPageOffset offset81581705 testing : currentChunkFirstDataPage offset81581772 testing : currentChunkDictionaryPageOffset offset82473442 testing : currentChunkFirstDataPage offset82473740 testing : currentChunkDictionaryPageOffset offset83918856 testing : currentChunkFirstDataPage offset83921564 testing : currentChunkDictionaryPageOffset offset85457651 testing : currentChunkFirstDataPage offset85457674 testing : currentChunkFirstDataPage offset85460523 testing : currentChunkDictionaryPageOffset offset132143159 testing : currentChunkFirstDataPage offset132146109 testing :block offset: 4 testing : currentChunkFirstDataPage offset133961161 testing : currentChunkFirstDataPage offset144992321 testing : currentChunkFirstDataPage offset172566390 testing : currentChunkDictionaryPageOffset offset183343431 testing : currentChunkFirstDataPage offset183401055 testing : currentChunkFirstDataPage offset185701717 testing : currentChunkDictionaryPageOffset offset196732869 testing : currentChunkFirstDataPage offset196762101 testing : currentChunkDictionaryPageOffset offset198896490 testing : currentChunkFirstDataPage offset198925722 testing : currentChunkDictionaryPageOffset offset201059822 testing : currentChunkFirstDataPage offset201059889 testing : currentChunkDictionaryPageOffset offset201582088 testing : currentChunkFirstDataPage offset201633695 testing : currentChunkDictionaryPageOffset offset203869258 testing : currentChunkFirstDataPage offset203920622 testing : currentChunkDictionaryPageOffset offset206163685 testing : currentChunkFirstDataPage offset206163718 testing : currentChunkDictionaryPageOffset offset206513919 testing : currentChunkFirstDataPage offset206514064 testing : currentChunkDictionaryPageOffset offset207484483 testing : currentChunkFirstDataPage offset207484550 testing : currentChunkDictionaryPageOffset offset208288402 testing : currentChunkFirstDataPage offset208288700 testing : currentChunkDictionaryPageOffset offset209591541 testing : currentChunkFirstDataPage offset209594249 testing : currentChunkDictionaryPageOffset offset210978198 testing : currentChunkFirstDataPage offset210978221 testing : currentChunkFirstDataPage offset210980774 testing : currentChunkDictionaryPageOffset offset253052539 testing : currentChunkFirstDataPage offset253055489 testing :block offset: 133961161 testing : set File_offset for rowgroup. with position: 4 testing : set File_offset for rowgroup. with position: 132143159 {noformat} Notice that the second file offset 132143159 is wrong(133961161 is expected), which is the last column's ChunkDictionaryPageOffset in the first rowgroup. > Failed to read parquet file after writing with the same parquet version > ----------------------------------------------------------------------- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Affects Versions: 1.12.0 > Reporter: Nemon Lou > Priority: Critical > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}] > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:96) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) ~[?:1.8.0_292] > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > ~[?:1.8.0_292] > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > ~[?:1.8.0_292] > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > ~[?:1.8.0_292] > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:254) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:214) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:342) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:716) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:175) > ~[hadoop-mapreduce-client-core-3.1.4.jar:?] > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) > ~[hadoop-mapreduce-client-core-3.1.4.jar:?] > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) > ~[hadoop-mapreduce-client-core-3.1.4.jar:?] > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271) > ~[hadoop-mapreduce-client-common-3.1.4.jar:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_292] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[?:1.8.0_292] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[?:1.8.0_292] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[?:1.8.0_292] > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292] > {noformat} > Repoduce Scenario: > TPC-DS table customer, any parquet file witten by 1.12.0 that larger than > 128MB(two row groups). > {code:sql} > create table if not exists customer( > c_customer_sk bigint > , c_customer_id char(16) > , c_current_cdemo_sk bigint > , c_current_hdemo_sk bigint > , c_current_addr_sk bigint > , c_first_shipto_date_sk bigint > , c_first_sales_date_sk bigint > , c_salutation char(10) > , c_first_name char(20) > , c_last_name char(30) > , c_preferred_cust_flag char(1) > , c_birth_day int > , c_birth_month int > , c_birth_year int > , c_birth_country varchar(20) > , c_login char(13) > , c_email_address char(50) > , c_last_review_date_sk bigint > ) > stored as parquet location 'file:///home/username/data/customer'; > --after add file: > select count(*) from customer; > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)