[ https://issues.apache.org/jira/browse/DRILL-3871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974617#comment-14974617 ]
ASF GitHub Bot commented on DRILL-3871: --------------------------------------- GitHub user parthchandra opened a pull request: https://github.com/apache/drill/pull/219 DRILL-3871: Off by one error while reading binary fields with one ter… …minal null in parquet. Changes - 1) Rewrote the NullableColumnReader.processPages function to process runs of Null values and Non-Null values without needing to keeping track of whether the previous iteration in the while loop had encountered a null or not. A pair of loops now iterates over a run of nulls or a run of non-null values. 2) Removed some redundant code. 3) Renamed some variables. The indexInOutputVector is now replaced by two local variables, readCount and writeCount only for clarity. 4) Adding tracing. 5) Added unit tests for edge cases of nulls occurring on page boundaries. For all the unit tests, tpch-h and tpc-ds test data sets, the state of the NullableColumnReader at the end of each iteration of processPages is identical to the old code. In addition the boundary conditions are taken care of. You can merge this pull request into a Git repository by running: $ git pull https://github.com/parthchandra/incubator-drill DRILL-3871 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/219.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #219 ---- commit d23ceb2a4c32da9535f1e482c4c70fcc31b8b2b8 Author: Parth Chandra <par...@apache.org> Date: 2015-10-05T17:25:56Z DRILL-3871: Off by one error while reading binary fields with one terminal null in parquet. ---- > Exception on inner join when join predicate is int96 field generated by impala > ------------------------------------------------------------------------------ > > Key: DRILL-3871 > URL: https://issues.apache.org/jira/browse/DRILL-3871 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Data Types > Affects Versions: 1.2.0 > Reporter: Victoria Markman > Assignee: Parth Chandra > Priority: Critical > Labels: int96 > Fix For: 1.3.0 > > Attachments: tables.tar > > > Both tables in the join where created by impala, with column c_timestamp > being parquet int96. > {code} > 0: jdbc:drill:schema=dfs> select > . . . . . . . . . . . . > max(t1.c_timestamp), > . . . . . . . . . . . . > min(t1.c_timestamp), > . . . . . . . . . . . . > count(t1.c_timestamp) > . . . . . . . . . . . . > from > . . . . . . . . . . . . > imp_t1 t1 > . . . . . . . . . . . . > inner join > . . . . . . . . . . . . > imp_t2 t2 > . . . . . . . . . . . . > on (t1.c_timestamp = t2.c_timestamp) > . . . . . . . . . . . . > ; > java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR: > TProtocolException: Required field 'uncompressed_page_size' was not found in > serialized data! Struct: PageHeader(type:null, uncompressed_page_size:0, > compressed_page_size:0) > Fragment 0:0 > [Error Id: eb6a5df8-fc59-409b-957a-59cb1079b5b8 on atsqa4-133.qa.lab:31010] > at sqlline.IncrementalRows.hasNext(IncrementalRows.java:73) > at > sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87) > at sqlline.TableOutputFormat.print(TableOutputFormat.java:118) > at sqlline.SqlLine.print(SqlLine.java:1583) > at sqlline.Commands.execute(Commands.java:852) > at sqlline.Commands.sql(Commands.java:751) > at sqlline.SqlLine.dispatch(SqlLine.java:738) > at sqlline.SqlLine.begin(SqlLine.java:612) > at sqlline.SqlLine.start(SqlLine.java:366) > at sqlline.SqlLine.main(SqlLine.java:259) > {code} > drillbit.log > {code} > 2015-09-30 21:15:45,710 [29f3aefe-3209-a6e6-0418-500dac60a339:foreman] INFO > o.a.d.exec.store.parquet.Metadata - Took 0 ms to get file statuses > 2015-09-30 21:15:45,712 [29f3aefe-3209-a6e6-0418-500dac60a339:foreman] INFO > o.a.d.exec.store.parquet.Metadata - Fetch parquet metadata: Executed 1 out of > 1 using 1 threads. Time: 1ms total, 1.645381ms avg, 1ms max. > 2015-09-30 21:15:45,712 [29f3aefe-3209-a6e6-0418-500dac60a339:foreman] INFO > o.a.d.exec.store.parquet.Metadata - Fetch parquet metadata: Executed 1 out of > 1 using 1 threads. Earliest start: 1.332000 μs, Latest start: 1.332000 μs, > Average start: 1.332000 μs . > 2015-09-30 21:15:45,830 [29f3aefe-3209-a6e6-0418-500dac60a339:frag:0:0] INFO > o.a.d.e.w.fragment.FragmentExecutor - > 29f3aefe-3209-a6e6-0418-500dac60a339:0:0: State change requested > AWAITING_ALLOCATION --> RUNNING > 2015-09-30 21:15:45,830 [29f3aefe-3209-a6e6-0418-500dac60a339:frag:0:0] INFO > o.a.d.e.w.f.FragmentStatusReporter - > 29f3aefe-3209-a6e6-0418-500dac60a339:0:0: State to report: RUNNING > 2015-09-30 21:15:45,925 [29f3aefe-3209-a6e6-0418-500dac60a339:frag:0:0] INFO > o.a.d.e.w.fragment.FragmentExecutor - > 29f3aefe-3209-a6e6-0418-500dac60a339:0:0: State change requested RUNNING --> > FAILED > 2015-09-30 21:15:45,930 [29f3aefe-3209-a6e6-0418-500dac60a339:frag:0:0] INFO > o.a.d.e.w.fragment.FragmentExecutor - > 29f3aefe-3209-a6e6-0418-500dac60a339:0:0: State change requested FAILED --> > FINISHED > 2015-09-30 21:15:45,931 [29f3aefe-3209-a6e6-0418-500dac60a339:frag:0:0] ERROR > o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: TProtocolException: > Required field 'uncompressed_page_size' was not found in serialized data! > Struct: PageHeader(type:null, uncompressed_page_size:0, > compressed_page_size:0) > Fragment 0:0 > [Error Id: eb6a5df8-fc59-409b-957a-59cb1079b5b8 on atsqa4-133.qa.lab:31010] > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: > TProtocolException: Required field 'uncompressed_page_size' was not found in > serialized data! Struct: PageHeader(type:null, uncompressed_page_size:0, > compressed_page_size:0) > Fragment 0:0 > [Error Id: eb6a5df8-fc59-409b-957a-59cb1079b5b8 on atsqa4-133.qa.lab:31010] > at > org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534) > ~[drill-common-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:323) > [drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:178) > [drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:292) > [drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > [drill-common-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_71] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_71] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] > Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in > parquet record reader. > Message: > Hadoop path: > /drill/testdata/subqueries/imp_t2/bf4261140dac8d45-814d66b86bf960b8_853027779_data.0.parq > Total records read: 10 > Mock records read: 0 > Records to read: 1 > Row group index: 0 > Records in row group: 10 > Parquet Metadata: ParquetMetaData{FileMetaData{schema: message schema { > optional binary c_varchar (UTF8); > optional int32 c_integer; > optional int64 c_bigint; > optional float c_float; > optional double c_double; > optional binary c_date (UTF8); > optional binary c_time (UTF8); > optional int96 c_timestamp; > optional boolean c_boolean; > optional double d9; > optional double d18; > optional double d28; > optional double d38; > } > , metadata: {}}, blocks: [BlockMetaData{10, 1507 [ColumnMetaData{SNAPPY > [c_varchar] BINARY [PLAIN, PLAIN_DICTIONARY, RLE], 173}, > ColumnMetaData{SNAPPY [c_integer] INT32 [PLAIN, PLAIN_DICTIONARY, RLE], > 299}, ColumnMetaData{SNAPPY [c_bigint] INT64 [PLAIN, PLAIN_DICTIONARY, RLE], > 453}, ColumnMetaData{SNAPPY [c_float] FLOAT [PLAIN, PLAIN_DICTIONARY, RLE], > 581}, ColumnMetaData{SNAPPY [c_double] DOUBLE [PLAIN, PLAIN_DICTIONARY, > RLE], 747}, ColumnMetaData{SNAPPY [c_date] BINARY [PLAIN, PLAIN_DICTIONARY, > RLE], 900}, ColumnMetaData{SNAPPY [c_time] BINARY [PLAIN, PLAIN_DICTIONARY, > RLE], 1045}, ColumnMetaData{SNAPPY [c_timestamp] INT96 [PLAIN, > PLAIN_DICTIONARY, RLE], 1213}, ColumnMetaData{SNAPPY [c_boolean] BOOLEAN > [PLAIN, PLAIN_DICTIONARY, RLE], 1293}, ColumnMetaData{SNAPPY [d9] DOUBLE > [PLAIN, PLAIN_DICTIONARY, RLE], 1448}, ColumnMetaData{SNAPPY [d18] DOUBLE > [PLAIN, PLAIN_DICTIONARY, RLE], 1609}, ColumnMetaData{SNAPPY [d28] DOUBLE > [PLAIN, PLAIN_DICTIONARY, RLE], 1771}, ColumnMetaData{SNAPPY [d38] DOUBLE > [PLAIN, PLAIN_DICTIONARY, RLE], 1933}]}]} > at > org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise(ParquetRecordReader.java:346) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:448) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:183) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:104) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:94) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:129) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:147) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:104) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.join.HashJoinBatch.executeBuildPhase(HashJoinBatch.java:403) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext(HashJoinBatch.java:218) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:147) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:104) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:94) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:129) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:147) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:104) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:94) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:136) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:147) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:104) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:94) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:129) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:147) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:83) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:80) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:73) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:258) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:252) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at java.security.AccessController.doPrivileged(Native Method) > ~[na:1.7.0_71] > at javax.security.auth.Subject.doAs(Subject.java:415) ~[na:1.7.0_71] > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566) > ~[hadoop-common-2.5.1-mapr-1503.jar:na] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:252) > [drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > ... 4 common frames omitted > Caused by: java.io.IOException: can not read class parquet.format.PageHeader: > Required field 'uncompressed_page_size' was not found in serialized data! > Struct: PageHeader(type:null, uncompressed_page_size:0, > compressed_page_size:0) > at parquet.format.Util.read(Util.java:50) > ~[parquet-format-2.1.1-drill-r1.jar:na] > at parquet.format.Util.readPageHeader(Util.java:26) > ~[parquet-format-2.1.1-drill-r1.jar:na] > at > org.apache.drill.exec.store.parquet.ColumnDataReader.readPageHeader(ColumnDataReader.java:46) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.store.parquet.columnreaders.PageReader.next(PageReader.java:191) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.store.parquet.columnreaders.NullableColumnReader.processPages(NullableColumnReader.java:76) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields(ParquetRecordReader.java:387) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > at > org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:430) > ~[drill-java-exec-1.2.0-SNAPSHOT.jar:1.2.0-SNAPSHOT] > ... 43 common frames omitted > Caused by: parquet.org.apache.thrift.protocol.TProtocolException: Required > field 'uncompressed_page_size' was not found in serialized data! Struct: > PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0) > at parquet.format.PageHeader.read(PageHeader.java:905) > ~[parquet-format-2.1.1-drill-r1.jar:na] > at parquet.format.Util.read(Util.java:47) > ~[parquet-format-2.1.1-drill-r1.jar:na] > ... 49 common frames omitted > 2015-09-30 21:15:45,951 [BitServer-4] WARN > o.a.drill.exec.work.foreman.Foreman - Dropping request to move to COMPLETED > state as query is already at FAILED state (which is terminal). > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)