Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16545 )
Change subject: IMPALA-10215: Implement INSERT INTO for non-partitioned Iceberg tables (Parquet) ...................................................................... Patch Set 2: (16 comments) Thanks for the comments. I'm going to rebase and add a stress test. http://gerrit.cloudera.org:8080/#/c/16545/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16545/1//COMMIT_MSG@30 PS1, Line 30: > It would be good to have some kind of parallel stress testing - the main qu I'll add stress testing after rebase. Iceberg let's you to atomically add a set of files to the data set via the AppendFiles API. http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/output-partition.h File be/src/exec/output-partition.h: http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/output-partition.h@25 PS1, Line 25: Needed > Needed? Done http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/parquet/hdfs-parquet-table-writer.h File be/src/exec/parquet/hdfs-parquet-table-writer.h: http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/parquet/hdfs-parquet-table-writer.h@225 PS1, Line 225: an Iceberg data file. > nit: an vs files Done http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/exec/parquet/hdfs-parquet-table-writer.h@226 PS1, Line 226: fill > nit: fills Done http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/coordinator.cc@802 PS1, Line 802: rn return_status; > remove log message Done http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/coordinator.cc@803 PS1, Line 803: } : > I am wondering which is the better place to call this: here, or in CatalogO Yeah, I was hesitating where to add the append files call. I think it's a good point that the table is locked in updateCatalog(), so it's safer doing it there. http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/dml-exec-state.h File be/src/runtime/dml-exec-state.h: http://gerrit.cloudera.org:8080/#/c/16545/1/be/src/runtime/dml-exec-state.h@33 PS1, Line 33: struc > struct Done http://gerrit.cloudera.org:8080/#/c/16545/1/common/thrift/Descriptors.thrift File common/thrift/Descriptors.thrift: http://gerrit.cloudera.org:8080/#/c/16545/1/common/thrift/Descriptors.thrift@54 PS1, Line 54: struct TColumnDescriptor { : 1: required string name : > Maybe some comment here? If this struct just contains fieldId, why not just I've created a struct for extendability. But probably we won't need other fields, so I'm switching to i32. http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4616 PS1, Line 4616: // new parts are only created in case the table is partitioned. > line too long (94 > 90) Done http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4660 PS1, Line 4660: // List of all insert events that we call HMS fireInsertEvent() on. > nit: +4 indentation Done http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/16545/1/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java@213 PS1, Line 213: IcebergUtil.getBaseTable(feIcebergTable); : AppendFiles append = nativeIcebergTable.newAppend(); > These two variables not been used in this method. Done http://gerrit.cloudera.org:8080/#/c/16545/2/impala-parent/pom.xml File impala-parent/pom.xml: http://gerrit.cloudera.org:8080/#/c/16545/2/impala-parent/pom.xml@66 PS2, Line 66: >0.8.0-incubating (I changed it temporarily for my IDE) Revert it to use the environment variable. http://gerrit.cloudera.org:8080/#/c/16545/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test: http://gerrit.cloudera.org:8080/#/c/16545/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test@12 PS1, Line 12: date_col DATE, > Should't this be DATE type instead to check coverage for that as well? 'str Good idea, done. http://gerrit.cloudera.org:8080/#/c/16545/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test@23 PS1, Line 23: > You can convert this to date like CAST(date_string_col as date FORMAT 'MM/D Done http://gerrit.cloudera.org:8080/#/c/16545/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-insert.test@172 PS1, Line 172: row_regex:'$NAMENODE/test-warehouse/$DATABASE/hadoop_catalog_test/test/custom_db/int_table/data/.*.0.parq','.*','' > As I see there is no such test where you insert multiple times into a table I'm using VERIFY_IS_SUBSET here, which checks one-by-one if the rows appear in the result set. I'm also using row_regex because I don't know the actual file names. Therefore this test would also pass if I copied this row multiple times. I could also get rid of VERIFY_IS_SUBSET, but then I would have to list all the files, even the ones that we might want to hide in the future. I opened IMPALA-10254 to load Iceberg files differently, I think we'll need such tests in the context of that Jira. http://gerrit.cloudera.org:8080/#/c/16545/2/tests/stress/test_acid_stress.py File tests/stress/test_acid_stress.py: http://gerrit.cloudera.org:8080/#/c/16545/2/tests/stress/test_acid_stress.py@1 PS2, Line 1: # Licensed to the Apache Software Foundation (ASF) under one Remove this file after rebase -- To view, visit http://gerrit.cloudera.org:8080/16545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5690fb6c2cc51f0033fa26caf8597c80a11bcd8e Gerrit-Change-Number: 16545 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: wangsheng <sky...@163.com> Gerrit-Comment-Date: Mon, 19 Oct 2020 09:36:05 +0000 Gerrit-HasComments: Yes