[ https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991608#comment-16991608 ]
László Bodor edited comment on HIVE-22579 at 12/9/19 2:01 PM: -------------------------------------------------------------- most of the failures are known since...2017!! https://issues.apache.org/jira/browse/HIVE-17436 except: (I'm creating tickets for failures seen without the patch) TestAccumuloCliDriver.testCliDriver[accumulo_queries] -> HIVE-22600 TestMiniLlapCliDriver.testCliDriver[table_nonprintable] TestMiniLlapLocalCliDriver.testCliDriver[vectorized_parquet_types] TestSparkCliDriver.testCliDriver[vectorization_input_format_excludes] TestStorageBasedMetastoreAuthorizationReads.testReadTableFailure was (Author: abstractdog): most of the failures are known since...2017!! https://issues.apache.org/jira/browse/HIVE-17436 except: TestAccumuloCliDriver.testCliDriver[accumulo_queries] HIVE-22600 TestMiniLlapCliDriver.testCliDriver[table_nonprintable] TestMiniLlapLocalCliDriver.testCliDriver[vectorized_parquet_types] TestSparkCliDriver.testCliDriver[vectorization_input_format_excludes] TestStorageBasedMetastoreAuthorizationReads.testReadTableFailure > ACID v1: covered delta-only splits (without base) should be marked as covered > (branch-2) > ---------------------------------------------------------------------------------------- > > Key: HIVE-22579 > URL: https://issues.apache.org/jira/browse/HIVE-22579 > Project: Hive > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Attachments: HIVE-22579.01.branch-2.patch, > HIVE-22579.01.branch-2.patch > > > There is a scenario when different SplitGenerator instances try to cover the > delta-only buckets (having no base file) more than once, so there could be > multiple OrcSplit instances generated for the same delta file, causing more > tasks to read the same delta file more than once, causing duplicate records > in a simple select star query. > File structure for a 256 bucket table > {code} > drwxrwxrwx - hive hadoop 0 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/base_0000013 > -rw-r--r-- 3 hive hadoop 353 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00012 > -rw-r--r-- 3 hive hadoop 1642 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00140 > drwxrwxrwx - hive hadoop 0 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000 > -rwxrwxrwx 3 hive hadoop 348 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00012 > -rwxrwxrwx 3 hive hadoop 1635 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00140 > drwxrwxrwx - hive hadoop 0 2019-11-29 16:04 > /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000 > -rwxrwxrwx 3 hive hadoop 348 2019-11-29 16:04 > /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00012 > -rwxrwxrwx 3 hive hadoop 1808 2019-11-29 16:04 > /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00140 > drwxrwxrwx - hive hadoop 0 2019-11-29 16:06 > /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000 > -rwxrwxrwx 3 hive hadoop 348 2019-11-29 16:06 > /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00043 > -rwxrwxrwx 3 hive hadoop 1633 2019-11-29 16:06 > /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171 > {code} > in this case, when bucket_00171 file has a record, and there is no base file > for that, a select (*) with ETL split strategy can generate 2 splits for the > same delta bucket... > the scenario of the issue: > 1. ETLSplitStrategy contains a [covered[] > array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763] > which is [shared between the SplitInfo > instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824] > to be created > 2. a SplitInfo instance is created for [every base file (2 in this > case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809] > 3. for every SplitInfo, [a SplitGenerator is > created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926], > and in the constructor, [parent's getSplit is > called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251], > which tries to take care of the deltas > I'm not sure at the moment what's the intention of this, but this way, > duplicated delta split can be generated, which can cause duplicated read > later (note that both tasks read the same delta file: bucket_00171) > {code} > 2019-12-01T16:24:53,669 INFO [TezTR-127843_16_30_0_171_0 > (1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from > hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171 > with {include: [true, true, true, true, true, true, true, true, true, true, > true, true], offset: 0, length: 9223372036854775807, schema: > struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>} > 2019-12-01T16:24:53,672 INFO [TezTR-127843_16_30_0_171_0 > (1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit > [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1, > start=171, length=0, isOriginal=false, fileLength=9223372036854775807, > hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: > [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: > 16 stmtIds: [0] }]] > 2019-12-01T16:24:55,807 INFO [TezTR-127843_16_30_0_425_0 > (1575040127843_0016_30_00_000425_0)] orc.ReaderImpl: Reading ORC rows from > hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171 > with {include: [true, true, true, true, true, true, true, true, true, true, > true, true], offset: 0, length: 9223372036854775807, schema: > struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>} > 2019-12-01T16:24:55,813 INFO [TezTR-127843_16_30_0_425_0 > (1575040127843_0016_30_00_000425_0)] lib.MRReaderMapred: Processing split: > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit > [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1, > start=171, length=0, isOriginal=false, fileLength=9223372036854775807, > hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: > [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: > 16 stmtIds: [0] }]] > {code} > seems like this issue doesn't affect AcidV2, as getSplits() returns an empty > collection or throws an exception in case of unexpected deltas (which was the > case here, where deltas was not unexpected): > https://github.com/apache/hive/blob/8ee3497f87f81fa84ee1023e891dc54087c2cd5e/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1178-L1197 -- This message was sent by Atlassian Jira (v8.3.4#803005)