[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Zoltan Borok-Nagy (Code Review) Thu, 16 Jan 2020 11:25:03 -0800

Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14711 )


Change subject: IMPALA-8778: Support Apache Hudi Read Optimized Table
......................................................................


Patch Set 14:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableAsSelectStmt.java@68
PS14, Line 68: THdfsFileFormat.HUDI_PARQUET
> Thanks for reviewing! I put HUDI_PARQUET here because the comment above say
This is just about the CREATE TABLE AS SELECT statement, which currently cannot 
be supported by Impala to create Hudi tables. This statement creates a new 
table based on the results in the SELECT statement. Therefore it would require 
Impala to write data in Hudi format.

You will still be able to define a Hudi table in a plain CREATE TABLE statement 
with partitions as well.


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java@74
PS14, Line 74: HUDI_PARQUET
> Make sense. statement like "CREATE TABLE LIKE PARQUET xxx STORED AS HUDIPAR
Right.


http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex
File fe/src/main/jflex/sql-scanner.flex:

http://gerrit.cloudera.org:8080/#/c/14711/14/fe/src/main/jflex/sql-scanner.flex@149
PS14, Line 149: hudiparquet
> I am following the naming convention of SEQUENCEFILE.
I see. I don't have a strong opinion about it, my personal preference is for 
HUDI_PARQUET, but I don't mind if you choose the other for consistency. Maybe 
other reviewers will also have an opinion about it.


http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py
File tests/common/test_dimensions.py:

http://gerrit.cloudera.org:8080/#/c/14711/14/tests/common/test_dimensions.py@33
PS14, Line 33: hudiparquet
> I am not quite sure if I understand the .test execution sequence correctly.
OK, so Impala's end-to-end tests are organized around workloads. A workload is 
basically the following:

* a set of tests
* test dimensions (contains the parameter space of the tests, e.g. file 
formats, compression, etc.) See 
testdata/workloads/functional-query/functional-query_dimensions.csv
* dataset, e.g. for the 'function-query' workload the 'functional' dataset 
belongs. This dataset needs to be loaded in multiple fileformats using 
different compressions. Not all combinations are need to be loaded, it is 
restricted by 'schema_constraints.csv'

Before you run any tests you need to load data into your cluster. There are 
several ways of doing this, one way is to invoke buildall.sh -testdata. Among 
other things it will load the 'functional' dataset required for the 
'functional-query' workload. You can look around in Impala shell and see what 
databases are created using 'show databases'.

Of course there are some exceptions, e.g. some tests load their own data, 
create their own tables, especially when we want to test CREATE or INSERT 
statements. These typically run in a temporary 'unique database'.

Now you can run the test cases. You run a test with an 'exploration strategy' 
which means it will run the test multiple times with different combinations of 
test parameters.

If you run a test with 'exhaustive' exploration strategy, it means it will run 
all the allowed combinations of the test parameters (as mentioned above, it is 
restricted by 'schema_constraints.csv'). E.g. the file 
'testdata/workloads/functional-query/functional-query_exhaustive.csv' contains 
these auto-generated parameter lists. A manually selected subset of it is in 
'functional-query_core.csv'. These are the parameter combinations that we think 
are the most important, we refer to it as the 'core' exploration strategy or 
just 'core tests'.

If you'd add HUDI_PARQUET to the test matrix of functional-query it'd mean that 
all tests belonging to it would be executed with file_format=HUDI_PARQUET as 
well. But since most of them would fail simply because we cannot load the 
'functional' dataset in Hudi format. So you'd need to manually exclude those 
tests from the test matrix with the help of 
'cls.ImpalaTestMatrix.add_constraint()'. E.g. tests in 
test_scanners.py::TesteParquet run only with file_format=='parquet'.

Other option is to define your own separate Hudi workload. Maybe this would be 
the cleanest, but it'd require the most boilerplate code probably.

You could also cheat and add your python test functions to the 
'functional-query' workload (without touching the test dimensions) and hardcode 
the file format in the query statements. At this point it might be OK to choose 
this option. I let you choose the option works best for you then we can 
re-iterate on it later.



--
To view, visit http://gerrit.cloudera.org:8080/14711
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Gerrit-Change-Number: 14711
Gerrit-PatchSet: 14
Gerrit-Owner: Yanjia Gary Li <yanjia.gary...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com>
Gerrit-Reviewer: Yanjia Gary Li <yanjia.gary...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Jan 2020 19:23:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8778: Support Apache Hudi Read Optimized Table

Reply via email to