Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/20405 )
Change subject: IMPALA-12406: OPTIMIZE statement as an alias for INSERT OVERWRITE ...................................................................... Patch Set 2: (13 comments) http://gerrit.cloudera.org:8080/#/c/20405/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20405/2//COMMIT_MSG@22 PS2, Line 22: tables with : parttition evolution I'd emphasise this restriction a bit more in the commit message, e.g. with a "RESTRICTIONS" section. Also if there are more restrictions, would be nice to see them listed in the message. http://gerrit.cloudera.org:8080/#/c/20405/2/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: http://gerrit.cloudera.org:8080/#/c/20405/2/fe/src/main/cup/sql-parser.cup@997 PS2, Line 997: // Optimize statement works only for Iceberg tables. I don't think this comment belongs here. http://gerrit.cloudera.org:8080/#/c/20405/2/fe/src/main/cup/sql-parser.cup@999 PS2, Line 999: KW_OPTIMIZE opt_kw_table table_name:table Dremio uses "OPTIMIZE TABLE". I just want to be sure that the choice of the command in Impala is something agreed by other query engines too (like Hive). Would this also work both with OPTIMIZE and OPTIMIZE TABLE? Could you add test coverage for both if haven't done so already http://gerrit.cloudera.org:8080/#/c/20405/2/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java File fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java: http://gerrit.cloudera.org:8080/#/c/20405/2/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java@31 PS2, Line 31: * Representation of an OPTIMIZE statement used to execute table maintenance tasks in I think it's beneficial for the reader to elaborate more on what maintenance tasks are possible here. It's just INSERT OVERWRITE now, but worth mentioning, and later on we can expand the comment with further functionalities. http://gerrit.cloudera.org:8080/#/c/20405/2/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java@36 PS2, Line 36: // INSERT OVERWRITE statement I don't see any value of this comment. It would be more informative that "INSERT OVERWRITE statement that this OPTIMIZE statement is translated into" or something similar. http://gerrit.cloudera.org:8080/#/c/20405/2/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java@39 PS2, Line 39: private TableName targetTableName_; This member made me wondering that the parent class, DmlStatementBase has an FeTable member called 'table_'. Shouldn't we initialise that somehow? Then the name of the table will be in the parent class and wouldn't be needed here. I'd check other classes derived from that base class how they handle that member. http://gerrit.cloudera.org:8080/#/c/20405/2/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java@48 PS2, Line 48: selectListItems.add(SelectListItem.createStarItem(null)); > We'll have to be careful with complex types. Normally they are not included Shouldn't we return an AnalysisException here when the table contains nested type columns? If the EXPAND... option is true, then at least the error message is relevant but if it's off, then "column number mismatch" error is not that usefule here for the users. I'd rather see a "we can't write complex types" error. http://gerrit.cloudera.org:8080/#/c/20405/2/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java@62 PS2, Line 62: insertStmt_.analyze(analyzer); I guess the insert statement itself would also check privileges on the target table, but to have coverage I'd add a simple Ranger test to see if the user is rejected if does not have enough privileges. http://gerrit.cloudera.org:8080/#/c/20405/2/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test: http://gerrit.cloudera.org:8080/#/c/20405/2/testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test@738 PS2, Line 738: Consider using TRUNCATE and INSERT INTO to overwrite your table. If you suggest the user to use "TRUNCATE+INSERT INTO" instead of OPTIMIZE then what happens if they just run these commands on the table? Wouldn't they wipe out all of their data from the table with TRUNCATE? http://gerrit.cloudera.org:8080/#/c/20405/2/testdata/workloads/functional-query/queries/QueryTest/iceberg-optimize.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-optimize.test: http://gerrit.cloudera.org:8080/#/c/20405/2/testdata/workloads/functional-query/queries/QueryTest/iceberg-optimize.test@4 PS2, Line 4: CREATE TABLE ice_optimize (i int, s string) All these tables in this test are non-partitioned tables. Could you please add some coverage for partitioned tables as well? I guess the expectation there is that we would have one file per partition after running OPTIMIZE. http://gerrit.cloudera.org:8080/#/c/20405/2/testdata/workloads/functional-query/queries/QueryTest/iceberg-optimize.test@6 PS2, Line 6: TBLPROPERTIES ('format-version'='2'); Does the table have to be V2 so that we can run OPTIMIZE? If yes, could you mention this in the commit msg? If not, then could you add some test coverage? I wouldn't expect much difference though, just to be on the safe side. http://gerrit.cloudera.org:8080/#/c/20405/2/testdata/workloads/functional-query/queries/QueryTest/iceberg-optimize.test@72 PS2, Line 72: ---- QUERY With this test case the purpose was to see if the delete file is gone after running OPTIMIZE, right? Could you also VERIFY the list of files then? http://gerrit.cloudera.org:8080/#/c/20405/2/testdata/workloads/functional-query/queries/QueryTest/iceberg-optimize.test@84 PS2, Line 84: ==== What I'd also test here is, assuming that OPTIMIZE makes a new snapshot for the table, to check the table history for the list of snapshots. Also I think time travelling to the pre-optimized table should still be possible, so some test coverage would be nice to see that an older snapshot is still readable. -- To view, visit http://gerrit.cloudera.org:8080/20405 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ief42537499ffe64fafdefe25c8d175539234c4e7 Gerrit-Change-Number: 20405 Gerrit-PatchSet: 2 Gerrit-Owner: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 29 Aug 2023 14:46:01 +0000 Gerrit-HasComments: Yes