Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 )
Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) ...................................................................... Patch Set 5: (6 comments) http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@9 PS4, Line 9: Hive ACID supports row-level DELETE and UPDATE operations on a table. > nit: spelling 'operatations' Done http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@11 PS4, Line 11: maintaining two sets of files in a table. The first set is in the > nit: spelling 'maintinaining' Done http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454 PS2, Line 1454: if (fd.getRelativePath().startsWith("delete_delta_")) { > I think for this patch it is fine to keep it this way. I saw in the doc th Yeah I agree. Will do that during perf analysis. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551 PS2, Line 1551: HdfsScanNode deltaScanNode = new HdfsScanNode(ctx_.getNextNodeId(), > Actually, the planner does this optimization as part of SanNode.applyCountS You are right, I was thinking about a different optimization in the ORC scanner which is also for count(*): https://github.com/apache/impala/blob/17fd15c6e4981499932c02d541c76757a5fdf87d/be/src/exec/hdfs-orc-scanner.cc#L531-L548 Anyway, now we have tests for count(*) + ACID so if someone implements the Parquet-kind count(*) optimization for ORC in the future, they'll know if they break stg. http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58 PS2, Line 58: | row-size=100B cardinality=143 > Regarding the conjuncts, wouldn't it be incorrect to even pass the conjunc Yeah I don't think we should pass the conjuncts to the delete deltas. Also it wouldn't have any effect since the delete delta files contain no real data, only the ACID fields (originaltransaction, rowid, etc.) which serve as a tombstone. To answer the question about cardinality, yes, COMPUTE STATS would show 90. It's because COMPUTE STATS is query-based, so it runs count(*) and count(distinct) queries on a table and stores the results of it. And the "DELETE TABLE" is not a separate table. The delete files are stored under the table's directory in so called "delete delta" directories next to the "delta" directories, e.g.: /test-warehouse/managed/full_acid_part/p=1/delete_delta_0000003_0000003_0000/bucket_00000 /test-warehouse/managed/full_acid_part/p=1/delta_0000001_0000001_0000/bucket_00000_0 Given that do you think we should rename "DELETE TABLE HASH JOIN" to "DELETE EVENTS HASH JOIN"? (The Hive docs usually refer to the rows in delete delta files as "delete events") http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286 PS4, Line 286: | | hash predicates: functional_orc_def.alltypes_deleted_rows.month = functional_orc_def.alltypes_deleted_rows-delete-delta.month, functional_orc_def.alltypes_deleted_rows.row__id.bucket = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction, functional_orc_def.alltypes_deleted_rows.row__id.rowid = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, functional_orc_def.alltypes_deleted_rows.year = functional_orc_def.alltypes_deleted_rows-delete-delta.year > This looks like it is adding all the other Slots to hash predicates, apart It only adds the hidden ACID columns + the partitioning columns (year, month). -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Mon, 22 Jun 2020 16:13:53 +0000 Gerrit-HasComments: Yes