Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/14711 )
Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table ...................................................................... Patch Set 14: (4 comments) http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java: http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java@68 PS14, Line 68: THdfsFileFormat.HUDI_PARQUET > Thanks for reviewing! I put HUDI_PARQUET here because the comment above say This is just about the CREATE TABLE AS SELECT statement, which currently cannot be supported by Impala to create Hudi tables. This statement creates a new table based on the results in the SELECT statement. Therefore it would require Impala to write data in Hudi format. You will still be able to define a Hudi table in a plain CREATE TABLE statement with partitions as well. http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java: http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java@74 PS14, Line 74: HUDI_PARQUET > Make sense. statement like "CREATE TABLE LIKE PARQUET xxx STORED AS HUDIPAR Right. http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex File fe/src/main/jflex/sql-scanner.flex: http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex@149 PS14, Line 149: hudiparquet > I am following the naming convention of SEQUENCEFILE. I see. I don't have a strong opinion about it, my personal preference is for HUDI_PARQUET, but I don't mind if you choose the other for consistency. Maybe other reviewers will also have an opinion about it. http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py File tests/common/test_dimensions.py: http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py@33 PS14, Line 33: hudiparquet > I am not quite sure if I understand the .test execution sequence correctly. OK, so Impala's end-to-end tests are organized around workloads. A workload is basically the following: * a set of tests * test dimensions (contains the parameter space of the tests, e.g. file formats, compression, etc.) See testdata/workloads/functional-query/functional-query_dimensions.csv * dataset, e.g. for the 'function-query' workload the 'functional' dataset belongs. This dataset needs to be loaded in multiple fileformats using different compressions. Not all combinations are need to be loaded, it is restricted by 'schema_constraints.csv' Before you run any tests you need to load data into your cluster. There are several ways of doing this, one way is to invoke buildall.sh -testdata. Among other things it will load the 'functional' dataset required for the 'functional-query' workload. You can look around in Impala shell and see what databases are created using 'show databases'. Of course there are some exceptions, e.g. some tests load their own data, create their own tables, especially when we want to test CREATE or INSERT statements. These typically run in a temporary 'unique database'. Now you can run the test cases. You run a test with an 'exploration strategy' which means it will run the test multiple times with different combinations of test parameters. If you run a test with 'exhaustive' exploration strategy, it means it will run all the allowed combinations of the test parameters (as mentioned above, it is restricted by 'schema_constraints.csv'). E.g. the file 'testdata/workloads/functional-query/functional-query_exhaustive.csv' contains these auto-generated parameter lists. A manually selected subset of it is in 'functional-query_core.csv'. These are the parameter combinations that we think are the most important, we refer to it as the 'core' exploration strategy or just 'core tests'. If you'd add HUDI_PARQUET to the test matrix of functional-query it'd mean that all tests belonging to it would be executed with file_format=HUDI_PARQUET as well. But since most of them would fail simply because we cannot load the 'functional' dataset in Hudi format. So you'd need to manually exclude those tests from the test matrix with the help of 'cls.ImpalaTestMatrix.add_constraint()'. E.g. tests in test_scanners.py::TesteParquet run only with file_format=='parquet'. Other option is to define your own separate Hudi workload. Maybe this would be the cleanest, but it'd require the most boilerplate code probably. You could also cheat and add your python test functions to the 'functional-query' workload (without touching the test dimensions) and hardcode the file format in the query statements. At this point it might be OK to choose this option. I let you choose the option works best for you then we can re-iterate on it later. -- To view, visit http://gerrit.cloudera.org:8080/14711 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf Gerrit-Change-Number: 14711 Gerrit-PatchSet: 14 Gerrit-Owner: Yanjia Gary Li <yanjia.gary...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com> Gerrit-Reviewer: Yanjia Gary Li <yanjia.gary...@gmail.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 16 Jan 2020 19:23:55 +0000 Gerrit-HasComments: Yes