[ https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor updated HIVE-22579: -------------------------------- Description: There is a scenario when different SplitGenerator instances try to cover the delta-only buckets (having no base file) more than once, so there could be multiple OrcSplit instances generated for the same delta file, causing more tasks to read the same delta file more than once, causing duplicate records in a simple select star query. File structure for a 256 bucket table {code} drwxrwxrwx - hive hadoop 0 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/base_0000013 -rw-r--r-- 3 hive hadoop 353 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00012 -rw-r--r-- 3 hive hadoop 1642 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00140 drwxrwxrwx - hive hadoop 0 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000 -rwxrwxrwx 3 hive hadoop 348 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00012 -rwxrwxrwx 3 hive hadoop 1635 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00140 drwxrwxrwx - hive hadoop 0 2019-11-29 16:04 /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000 -rwxrwxrwx 3 hive hadoop 348 2019-11-29 16:04 /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00012 -rwxrwxrwx 3 hive hadoop 1808 2019-11-29 16:04 /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00140 drwxrwxrwx - hive hadoop 0 2019-11-29 16:06 /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000 -rwxrwxrwx 3 hive hadoop 348 2019-11-29 16:06 /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00043 -rwxrwxrwx 3 hive hadoop 1633 2019-11-29 16:06 /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171 {code} in this case, when bucket_00171 file has a record, and there is no base file for that, a select (*) with ETL split strategy can generate 2 splits for the same delta bucket... > ACID v1: covered delta-only splits (without base) should be marked as covered > (branch-2) > ---------------------------------------------------------------------------------------- > > Key: HIVE-22579 > URL: https://issues.apache.org/jira/browse/HIVE-22579 > Project: Hive > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Attachments: HIVE-22579.01.branch-2.patch > > > There is a scenario when different SplitGenerator instances try to cover the > delta-only buckets (having no base file) more than once, so there could be > multiple OrcSplit instances generated for the same delta file, causing more > tasks to read the same delta file more than once, causing duplicate records > in a simple select star query. > File structure for a 256 bucket table > {code} > drwxrwxrwx - hive hadoop 0 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/base_0000013 > -rw-r--r-- 3 hive hadoop 353 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00012 > -rw-r--r-- 3 hive hadoop 1642 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00140 > drwxrwxrwx - hive hadoop 0 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000 > -rwxrwxrwx 3 hive hadoop 348 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00012 > -rwxrwxrwx 3 hive hadoop 1635 2019-11-29 15:55 > /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00140 > drwxrwxrwx - hive hadoop 0 2019-11-29 16:04 > /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000 > -rwxrwxrwx 3 hive hadoop 348 2019-11-29 16:04 > /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00012 > -rwxrwxrwx 3 hive hadoop 1808 2019-11-29 16:04 > /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00140 > drwxrwxrwx - hive hadoop 0 2019-11-29 16:06 > /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000 > -rwxrwxrwx 3 hive hadoop 348 2019-11-29 16:06 > /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00043 > -rwxrwxrwx 3 hive hadoop 1633 2019-11-29 16:06 > /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171 > {code} > in this case, when bucket_00171 file has a record, and there is no base file > for that, a select (*) with ETL split strategy can generate 2 splits for the > same delta bucket... -- This message was sent by Atlassian Jira (v8.3.4#803005)