Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16545 )

Change subject: IMPALA-10215: Implement INSERT INTO for non-partitioned Iceberg 
tables (Parquet)
......................................................................


Patch Set 2:

(16 comments)

Thanks for the comments. I'm going to rebase and add a stress test.

http://gerrit.cloudera.org:8080/#/c/16545/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16545/1//COMMIT_MSG@30
PS1, Line 30:
> It would be good to have some kind of parallel stress testing - the main qu
I'll add stress testing after rebase.

Iceberg let's you to atomically add a set of files to the data set via the 
AppendFiles API.


http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/output-partition.h
File be/src/exec/output-partition.h:

http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/output-partition.h@25
PS1, Line 25: Needed
> Needed?
Done


http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/parquet/hdfs-parquet-table-writer.h
File be/src/exec/parquet/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/parquet/hdfs-parquet-table-writer.h@225
PS1, Line 225: an Iceberg data file.
> nit: an vs files
Done


http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/parquet/hdfs-parquet-table-writer.h@226
PS1, Line 226: fill
> nit: fills
Done


http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/coordinator.cc@802
PS1, Line 802: rn return_status;
> remove log message
Done


http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/coordinator.cc@803
PS1, Line 803: }
             :
> I am wondering which is the better place to call this: here, or in CatalogO
Yeah, I was hesitating where to add the append files call. I think it's a good 
point that the table is locked in updateCatalog(), so it's safer doing it there.


http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/dml-exec-state.h
File be/src/runtime/dml-exec-state.h:

http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/dml-exec-state.h@33
PS1, Line 33: struc
> struct
Done


http://gerrit.cloudera.org:8080/#/c/16545/1/common/thrift/Descriptors.thrift
File common/thrift/Descriptors.thrift:

http://gerrit.cloudera.org:8080/#/c/16545/1/common/thrift/Descriptors.thrift@54
PS1, Line 54: struct TColumnDescriptor {
            :   1: required string name
            :
> Maybe some comment here? If this struct just contains fieldId, why not just
I've created a struct for extendability. But probably we won't need other 
fields, so I'm switching to i32.


http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4616
PS1, Line 4616:         // new parts are only created in case the table is 
partitioned.
> line too long (94 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4660
PS1, Line 4660:     // List of all insert events that we call HMS 
fireInsertEvent() on.
> nit: +4 indentation
Done


http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java@213
PS1, Line 213:         IcebergUtil.getBaseTable(feIcebergTable);
             :     AppendFiles append = nativeIcebergTable.newAppend();
> These two variables not been used in this method.
Done


http://gerrit.cloudera.org:8080/#/c/16545/2/impala-parent/pom.xml
File impala-parent/pom.xml:

http://gerrit.cloudera.org:8080/#/c/16545/2/impala-parent/pom.xml@66
PS2, Line 66: >0.8.0-incubating
(I changed it temporarily for my IDE) Revert it to use the environment variable.


http://gerrit.cloudera.org:8080/#/c/16545/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test:

http://gerrit.cloudera.org:8080/#/c/16545/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test@12
PS1, Line 12: date_col DATE,
> Should't this be DATE type instead to check coverage for that as well? 'str
Good idea, done.


http://gerrit.cloudera.org:8080/#/c/16545/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test@23
PS1, Line 23:
> You can convert this to date like CAST(date_string_col as date FORMAT 'MM/D
Done


http://gerrit.cloudera.org:8080/#/c/16545/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test@172
PS1, Line 172: 
row_regex:'$NAMENODE/test-warehouse/$DATABASE/hadoop_catalog_test/test/custom_db/int_table/data/.*.0.parq','.*',''
> As I see there is no such test where you insert multiple times into a table
I'm using VERIFY_IS_SUBSET here, which checks one-by-one if the rows appear in 
the result set. I'm also using row_regex because I don't know the actual file 
names. Therefore this test would also pass if I copied this row multiple times.

I could also get rid of VERIFY_IS_SUBSET, but then I would have to list all the 
files, even the ones that we might want to hide in the future. I opened 
IMPALA-10254 to load Iceberg files differently, I think we'll need such tests 
in the context of that Jira.


http://gerrit.cloudera.org:8080/#/c/16545/2/tests/stress/test_acid_stress.py
File tests/stress/test_acid_stress.py:

http://gerrit.cloudera.org:8080/#/c/16545/2/tests/stress/test_acid_stress.py@1
PS2, Line 1: # Licensed to the Apache Software Foundation (ASF) under one
Remove this file after rebase



--
To view, visit http://gerrit.cloudera.org:8080/16545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5690fb6c2cc51f0033fa26caf8597c80a11bcd8e
Gerrit-Change-Number: 16545
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: wangsheng <sky...@163.com>
Gerrit-Comment-Date: Mon, 19 Oct 2020 09:36:05 +0000
Gerrit-HasComments: Yes

Reply via email to