[jira] [Comment Edited] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

Nemon Lou (Jira) Thu, 26 Aug 2021 05:14:08 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405183#comment-17405183
 ]


Nemon Lou edited comment on PARQUET-2078 at 8/26/21, 12:13 PM:
---------------------------------------------------------------

Root cause analyze:
The file offset added in parquet 1.12.0 can go wrong under certain conditions.
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L580
When using ParquetRecordReader with blocks passed in, 
the wrongly setted file offset causes filtering out some blocks due to offset 
missmatch.
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1264

When file offset can go wrong?
Here are writter debug logs to make it clearfy.





testing : currentChunkFirstDataPage offset4
testing : currentChunkFirstDataPage offset12243647
testing : currentChunkFirstDataPage offset42848491
testing : currentChunkDictionaryPageOffset offset54810911
testing : currentChunkFirstDataPage offset54868535
testing : currentChunkFirstDataPage offset57421932
testing : currentChunkDictionaryPageOffset offset69665577
testing : currentChunkFirstDataPage offset69694809
testing : currentChunkDictionaryPageOffset offset72063808
testing : currentChunkFirstDataPage offset72093040
testing : currentChunkDictionaryPageOffset offset74461441
testing : currentChunkFirstDataPage offset74461508
testing : currentChunkDictionaryPageOffset offset75041119
testing : currentChunkFirstDataPage offset75092758
testing : currentChunkDictionaryPageOffset offset77575161
testing : currentChunkFirstDataPage offset77626525
testing : currentChunkDictionaryPageOffset offset80116424
testing : currentChunkFirstDataPage offset80116456
testing : currentChunkDictionaryPageOffset offset80505206
testing : currentChunkFirstDataPage offset80505351
testing : currentChunkDictionaryPageOffset offset81581705
testing : currentChunkFirstDataPage offset81581772
testing : currentChunkDictionaryPageOffset offset82473442
testing : currentChunkFirstDataPage offset82473740
testing : currentChunkDictionaryPageOffset offset83918856
testing : currentChunkFirstDataPage offset83921564
testing : currentChunkDictionaryPageOffset offset85457651
testing : currentChunkFirstDataPage offset85457674
testing : currentChunkFirstDataPage offset85460523
testing : currentChunkDictionaryPageOffset offset{color:red}132143159{color}
testing : currentChunkFirstDataPage offset132146109
testing :block offset: 4
testing : currentChunkFirstDataPage offset133961161
testing : currentChunkFirstDataPage offset144992321
testing : currentChunkFirstDataPage offset172566390
testing : currentChunkDictionaryPageOffset offset183343431
testing : currentChunkFirstDataPage offset183401055
testing : currentChunkFirstDataPage offset185701717
testing : currentChunkDictionaryPageOffset offset196732869
testing : currentChunkFirstDataPage offset196762101
testing : currentChunkDictionaryPageOffset offset198896490
testing : currentChunkFirstDataPage offset198925722
testing : currentChunkDictionaryPageOffset offset201059822
testing : currentChunkFirstDataPage offset201059889
testing : currentChunkDictionaryPageOffset offset201582088
testing : currentChunkFirstDataPage offset201633695
testing : currentChunkDictionaryPageOffset offset203869258
testing : currentChunkFirstDataPage offset203920622
testing : currentChunkDictionaryPageOffset offset206163685
testing : currentChunkFirstDataPage offset206163718
testing : currentChunkDictionaryPageOffset offset206513919
testing : currentChunkFirstDataPage offset206514064
testing : currentChunkDictionaryPageOffset offset207484483
testing : currentChunkFirstDataPage offset207484550
testing : currentChunkDictionaryPageOffset offset208288402
testing : currentChunkFirstDataPage offset208288700
testing : currentChunkDictionaryPageOffset offset209591541
testing : currentChunkFirstDataPage offset209594249
testing : currentChunkDictionaryPageOffset offset210978198
testing : currentChunkFirstDataPage offset210978221
testing : currentChunkFirstDataPage offset210980774
testing : currentChunkDictionaryPageOffset offset253052539
testing : currentChunkFirstDataPage offset253055489
testing :block offset: 133961161
testing : set File_offset for rowgroup. with position: 4
testing : set File_offset for rowgroup. with position: 
{color:red}132143159{color}

 
Notice that the second file offset 132143159 is wrong(133961161 is expected), 
which is the last column's ChunkDictionaryPageOffset in the first rowgroup.


was (Author: nemon):
Root cause analyze:
The file offset added in parquet 1.12.0 can go wrong under certain conditions.
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L580
When using ParquetRecordReader with blocks passed in, 
the wrongly setted file offset causes filtering out some blocks due to offset 
missmatch.
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1264

When file offset can go wrong?
Here are writter debug logs to make it clearfy.
{noformat}
testing : currentChunkFirstDataPage offset4
testing : currentChunkFirstDataPage offset12243647
testing : currentChunkFirstDataPage offset42848491
testing : currentChunkDictionaryPageOffset offset54810911
testing : currentChunkFirstDataPage offset54868535
testing : currentChunkFirstDataPage offset57421932
testing : currentChunkDictionaryPageOffset offset69665577
testing : currentChunkFirstDataPage offset69694809
testing : currentChunkDictionaryPageOffset offset72063808
testing : currentChunkFirstDataPage offset72093040
testing : currentChunkDictionaryPageOffset offset74461441
testing : currentChunkFirstDataPage offset74461508
testing : currentChunkDictionaryPageOffset offset75041119
testing : currentChunkFirstDataPage offset75092758
testing : currentChunkDictionaryPageOffset offset77575161
testing : currentChunkFirstDataPage offset77626525
testing : currentChunkDictionaryPageOffset offset80116424
testing : currentChunkFirstDataPage offset80116456
testing : currentChunkDictionaryPageOffset offset80505206
testing : currentChunkFirstDataPage offset80505351
testing : currentChunkDictionaryPageOffset offset81581705
testing : currentChunkFirstDataPage offset81581772
testing : currentChunkDictionaryPageOffset offset82473442
testing : currentChunkFirstDataPage offset82473740
testing : currentChunkDictionaryPageOffset offset83918856
testing : currentChunkFirstDataPage offset83921564
testing : currentChunkDictionaryPageOffset offset85457651
testing : currentChunkFirstDataPage offset85457674
testing : currentChunkFirstDataPage offset85460523
testing : currentChunkDictionaryPageOffset offset132143159
testing : currentChunkFirstDataPage offset132146109
testing :block offset: 4
testing : currentChunkFirstDataPage offset133961161
testing : currentChunkFirstDataPage offset144992321
testing : currentChunkFirstDataPage offset172566390
testing : currentChunkDictionaryPageOffset offset183343431
testing : currentChunkFirstDataPage offset183401055
testing : currentChunkFirstDataPage offset185701717
testing : currentChunkDictionaryPageOffset offset196732869
testing : currentChunkFirstDataPage offset196762101
testing : currentChunkDictionaryPageOffset offset198896490
testing : currentChunkFirstDataPage offset198925722
testing : currentChunkDictionaryPageOffset offset201059822
testing : currentChunkFirstDataPage offset201059889
testing : currentChunkDictionaryPageOffset offset201582088
testing : currentChunkFirstDataPage offset201633695
testing : currentChunkDictionaryPageOffset offset203869258
testing : currentChunkFirstDataPage offset203920622
testing : currentChunkDictionaryPageOffset offset206163685
testing : currentChunkFirstDataPage offset206163718
testing : currentChunkDictionaryPageOffset offset206513919
testing : currentChunkFirstDataPage offset206514064
testing : currentChunkDictionaryPageOffset offset207484483
testing : currentChunkFirstDataPage offset207484550
testing : currentChunkDictionaryPageOffset offset208288402
testing : currentChunkFirstDataPage offset208288700
testing : currentChunkDictionaryPageOffset offset209591541
testing : currentChunkFirstDataPage offset209594249
testing : currentChunkDictionaryPageOffset offset210978198
testing : currentChunkFirstDataPage offset210978221
testing : currentChunkFirstDataPage offset210980774
testing : currentChunkDictionaryPageOffset offset253052539
testing : currentChunkFirstDataPage offset253055489
testing :block offset: 133961161
testing : set File_offset for rowgroup. with position: 4
testing : set File_offset for rowgroup. with position: 132143159
{noformat}
 
Notice that the second file offset 132143159 is wrong(133961161 is expected), 
which is the last column's ChunkDictionaryPageOffset in the first rowgroup.

> Failed to read parquet file after writing with the same parquet version
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2078
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2078
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: Nemon Lou
>            Priority: Critical
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}]
>       at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>       at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>       at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:96)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_292]
>       at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_292]
>       at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_292]
>       at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_292]
>       at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:254)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:214)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:342)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:716)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>       at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:175) 
> ~[hadoop-mapreduce-client-core-3.1.4.jar:?]
>       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) 
> ~[hadoop-mapreduce-client-core-3.1.4.jar:?]
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) 
> ~[hadoop-mapreduce-client-core-3.1.4.jar:?]
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)
>  ~[hadoop-mapreduce-client-common-3.1.4.jar:?]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[?:1.8.0_292]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[?:1.8.0_292]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_292]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_292]
>       at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
> {noformat}
> Repoduce Scenario:
> TPC-DS table customer, any parquet file witten by 1.12.0 that larger than 
> 128MB(two row groups).
> {code:sql}
> create  table if not exists customer(
>       c_customer_sk bigint
> ,     c_customer_id char(16)
> ,     c_current_cdemo_sk bigint
> ,     c_current_hdemo_sk bigint
> ,     c_current_addr_sk bigint
> ,     c_first_shipto_date_sk bigint
> ,     c_first_sales_date_sk bigint
> ,     c_salutation char(10)
> ,     c_first_name char(20)
> ,     c_last_name char(30)
> ,     c_preferred_cust_flag char(1)
> ,     c_birth_day int
> ,     c_birth_month int
> ,     c_birth_year int
> ,     c_birth_country varchar(20)
> ,     c_login char(13)
> ,     c_email_address char(50)
> ,     c_last_review_date_sk bigint
> )
> stored as parquet location 'file:///home/username/data/customer';
> --after add file:
> select count(*) from customer;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

Reply via email to