[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Zoltan Borok-Nagy (Code Review) Mon, 22 Jun 2020 09:14:25 -0700

Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16082 )


Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified 
tables (primitive types)
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@9
PS4, Line 9: Hive ACID supports row-level DELETE and UPDATE operations on a 
table.
> nit: spelling 'operatations'
Done


http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@11
PS4, Line 11: maintaining two sets of files in a table. The first set is in the
> nit: spelling 'maintinaining'
Done


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454
PS2, Line 1454:         if (fd.getRelativePath().startsWith("delete_delta_")) {
> I think for this patch it is fine to keep it this way.  I saw in the doc th
Yeah I agree. Will do that during perf analysis.


http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551
PS2, Line 1551:     HdfsScanNode deltaScanNode = new 
HdfsScanNode(ctx_.getNextNodeId(),
> Actually, the planner does this optimization as part of SanNode.applyCountS
You are right, I was thinking about a different optimization in the ORC scanner 
which is also for count(*):

https://github.com/apache/impala/blob/17fd15c6e4981499932c02d541c76757a5fdf87d/be/src/exec/hdfs-orc-scanner.cc#L531-L548

Anyway, now we have tests for count(*) + ACID so if someone implements the 
Parquet-kind count(*) optimization for ORC in the future, they'll know if they 
break stg.


http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58
PS2, Line 58: |  row-size=100B cardinality=143
> Regarding the conjuncts, wouldn't it be incorrect to  even pass the conjunc
Yeah I don't think we should pass the conjuncts to the delete deltas. Also it 
wouldn't have any effect since the delete delta files contain no real data, 
only the ACID fields (originaltransaction, rowid, etc.) which serve as a 
tombstone.

To answer the question about cardinality, yes, COMPUTE STATS would show 90. 
It's because COMPUTE STATS is query-based, so it runs count(*) and 
count(distinct) queries on a table and stores the results of it. And the 
"DELETE TABLE" is not a separate table. The delete files are stored under the 
table's directory in so called "delete delta" directories next to the "delta" 
directories, e.g.:

  
/test-warehouse/managed/full_acid_part/p=1/delete_delta_0000003_0000003_0000/bucket_00000
  
/test-warehouse/managed/full_acid_part/p=1/delta_0000001_0000001_0000/bucket_00000_0

Given that do you think we should rename "DELETE TABLE HASH JOIN" to "DELETE 
EVENTS HASH JOIN"? (The Hive docs usually refer to the rows in delete delta 
files as "delete events")


http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286
PS4, Line 286: |  |  hash predicates: 
functional_orc_def.alltypes_deleted_rows.month = 
functional_orc_def.alltypes_deleted_rows-delete-delta.month, 
functional_orc_def.alltypes_deleted_rows.row__id.bucket = 
functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, 
functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = 
functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction,
 functional_orc_def.alltypes_deleted_rows.row__id.rowid = 
functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, 
functional_orc_def.alltypes_deleted_rows.year = 
functional_orc_def.alltypes_deleted_rows-delete-delta.year
> This looks like it is adding all the other Slots to hash predicates, apart
It only adds the hidden ACID columns + the partitioning columns (year, month).



--
To view, visit http://gerrit.cloudera.org:8080/16082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659
Gerrit-Change-Number: 16082
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Jun 2020 16:13:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)

Reply via email to