[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads
Sailesh Mukil has uploaded a new patch set (#5). Change subject: IMPALA-3823: Add timer to measure Parquet footer reads .. IMPALA-3823: Add timer to measure Parquet footer reads It's been observed that Parquet footer reads perform poorly especially when reading from S3. This patch adds a timer "FooterProcessingTimer" which keeps a track of the average time each split of each scan node spends in reading and processing the parquet footer. Added a new utility counter called SummaryStatsCounter which keeps track of the min, max and average values seen so far from a set of values. This counter is used to calculate the min, max and average time taken to scan and process Parquet footers per query per node. The RuntimeProfile has also been updated to keep a track of, display and serialize this new counter to thrift. BE tests have been added to verify that this counter works fine. Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/util/runtime-profile-counters.h M be/src/util/runtime-profile-test.cc M be/src/util/runtime-profile.cc M be/src/util/runtime-profile.h M common/thrift/RuntimeProfile.thrift M tests/query_test/test_scanners.py 8 files changed, 288 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/5 -- To view, visit http://gerrit.cloudera.org:8080/4371 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil
[Impala-ASF-CR] IMPALA-1654: General partition exprs in DDL operations.
Alex Behm has posted comments on this change. Change subject: IMPALA-1654: General partition exprs in DDL operations. .. Patch Set 11: (1 comment) I agree we should follow up offline to fix your env issues and update the wiki if necessary. http://gerrit.cloudera.org:8080/#/c/3942/11/fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java: Line 80: if (tableRef instanceof CollectionTableRef) { > Is it possible for an AlterTableStmt getting a CollectionTableRef? I think you could have something like this: use functional; alter table allcomplextypes.int_array_col set fileformat parquet; You can look at some existing analyzer tests to see how to create a test that has a "functional" default db. -- To view, visit http://gerrit.cloudera.org:8080/3942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Amos BirdGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Amos Bird Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-1654: General partition exprs in DDL operations.
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3942 to look at the new patch set (#12). Change subject: IMPALA-1654: General partition exprs in DDL operations. .. IMPALA-1654: General partition exprs in DDL operations. This commit handles partition related DDL in a more general way. We can now use compound predicates to specify a list of partitions in statements like ALTER TABLE DROP PARTITION and COMPUTE INCREMENTAL STATS, etc. It will also make sure some statements only accept one partition at a time, such as PARTITION SET LOCATION and LOAD DATA. ALTER TABLE ADD PARTITION remains using the old PartitionKeyValue's logic. The changed partition related DDLs are as follows, Table: p (i int) partitioned by (j int, k string) Partitions: +---+---+---++--+--+---+ | j | k | #Rows | #Files | Size | Bytes Cached | Cache Replication | +---+---+---++--+--+---+ | 1 | a | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 1 | b | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 1 | c | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 2 | d | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 2 | e | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 2 | f | -1| 0 | 0B | NOT CACHED | NOT CACHED| | Total | | -1| 0 | 0B | 0B | | +---+---+---++--+--+---+ 1. show files in p partition (j<2, k='a'); 2. alter table p partition (j<2, k in ("b","c") set cached in 'testPool'; // j can appear more than once, 3.1. alter table p partition (j<2, j>0, k<>"d") set uncached; // it is the same as 3.2. alter table p partition (j<2 and j>0, not k="e") set uncached; // we can also do 'or' 3.3. alter table p partition (j<2 or j>0, k like "%") set uncached; // missing 'k' matches all values of k 4. alter table p partition (j<2) set fileformat textfile; 5. alter table p partition (k rlike ".*") set serdeproperties ("k"="v"); 6. alter table p partition (j is not null) set tblproperties ("k"="v"); 7. alter table p drop partition (j<2); 8. compute incremental stats p partition(j<2); The remaining old partition related DDLs are as follows, 1. load data inpath '/path/from' into table p partition (j=2, k="d"); 2. alter table p add partition (j=2, k="g"); 3. alter table p partition (j=2, k="g") set location '/path/to'; 4. insert into p partition (j=2, k="g") values (1), (2), (3); General partition expressions or partially specified partition specs allows partition predicates to return empty partition set no matter 'IF EXISTS' is specified. Examples: [localhost.localdomain:21000] > alter table p drop partition (j=2, k="f"); Query: alter table p drop partition (j=2, k="f") +-+ | summary | +-+ | Dropped 1 partition(s). | +-+ Fetched 1 row(s) in 0.78s [localhost.localdomain:21000] > alter table p drop partition (j=2, k<"f"); Query: alter table p drop partition (j=2, k<"f") +-+ | summary | +-+ | Dropped 2 partition(s). | +-+ Fetched 1 row(s) in 0.41s [localhost.localdomain:21000] > alter table p drop partition (k="a"); Query: alter table p drop partition (k="a") +-+ | summary | +-+ | Dropped 1 partition(s). | +-+ Fetched 1 row(s) in 0.25s [localhost.localdomain:21000] > show partitions p; Query: show partitions p +---+---+---++--+--+---+ | j | k | #Rows | #Files | Size | Bytes Cached | Cache Replication | +---+---+---++--+--+---+ | 1 | b | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 1 | c | -1| 0 | 0B | NOT CACHED | NOT CACHED| | Total | | -1| 0 | 0B | 0B | | +---+---+---++--+--+---+ Fetched 3 row(s) in 0.01s Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f --- M be/src/service/query-exec-state.cc M be/src/service/query-exec-state.h M common/thrift/CatalogService.thrift M common/thrift/Frontend.thrift M common/thrift/JniCatalog.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/com/cloudera/impala/analysis/AlterTableDropPartitionStmt.java M fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetCachedStmt.java M fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetFileFormatStmt.java M fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetLocationStmt.java M
[Impala-ASF-CR] IMPALA-1654: General partition exprs in DDL operations.
Amos Bird has posted comments on this change. Change subject: IMPALA-1654: General partition exprs in DDL operations. .. Patch Set 11: (1 comment) Thanks for helping me fix the auditing bugs. I wish this wiki may help me get my local env working. https://cwiki.apache.org/confluence/display/IMPALA/How+to+load+and+run+Impala+tests http://gerrit.cloudera.org:8080/#/c/3942/11/fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java: Line 80: if (tableRef instanceof CollectionTableRef) { > Could you add a test for this? Is it possible for an AlterTableStmt getting a CollectionTableRef? -- To view, visit http://gerrit.cloudera.org:8080/3942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Amos BirdGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Amos Bird Gerrit-HasComments: Yes
[Impala-ASF-CR] Bump Kudu version to 1.0-RC1
Dimitris Tsirogiannis has posted comments on this change. Change subject: Bump Kudu version to 1.0-RC1 .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/4420 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibbe554d6782212f91db07757f429c5571a7a44da Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew JacobsGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3739: Enable stress tests on Kudu
Dimitris Tsirogiannis has uploaded a new patch set (#4). Change subject: IMPALA-3739: Enable stress tests on Kudu .. IMPALA-3739: Enable stress tests on Kudu This commit modifies the stress test framework to run TPC-H and TPC-DS workloads against Kudu. The follwing changes are included in this commit: 1. Created template files with DDL and DML statements for loading TPC-H and TPC-DS data in Kudu 2. Created a script (load-tpc-kudu.py) to load data in Kudu. The script is invoked by the stress test runner to load test data in an existing Impala/Kudu cluster (both local and CM-managed clusters are supported). 3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL files with TPC-H queries for Kudu were added in a previous patch. 4. Modified the stress test runner to take additional parameters specific to Kudu (e.g. kudu master addr) The stress test runner for Kudu was tested on EC2 clusters for both TPC-H and TPC-DS workloads. Missing functionality: * No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu. * Not all supported TPC-DS queries are included. Currently, only the TPC-DS queries from the testdata/workloads/tpcds/queries directory were modified to run against Kudu. Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 --- A testdata/bin/load-tpc-kudu.py A testdata/datasets/tpcds/tpcds_kudu_template.sql A testdata/datasets/tpch/tpch_kudu_template.sql A testdata/workloads/tpcds/queries/tpcds-kudu-q19.test A testdata/workloads/tpcds/queries/tpcds-kudu-q27.test A testdata/workloads/tpcds/queries/tpcds-kudu-q3.test A testdata/workloads/tpcds/queries/tpcds-kudu-q34.test A testdata/workloads/tpcds/queries/tpcds-kudu-q42.test A testdata/workloads/tpcds/queries/tpcds-kudu-q43.test A testdata/workloads/tpcds/queries/tpcds-kudu-q46.test A testdata/workloads/tpcds/queries/tpcds-kudu-q47.test A testdata/workloads/tpcds/queries/tpcds-kudu-q52.test A testdata/workloads/tpcds/queries/tpcds-kudu-q53.test A testdata/workloads/tpcds/queries/tpcds-kudu-q55.test A testdata/workloads/tpcds/queries/tpcds-kudu-q59.test A testdata/workloads/tpcds/queries/tpcds-kudu-q6.test A testdata/workloads/tpcds/queries/tpcds-kudu-q61.test A testdata/workloads/tpcds/queries/tpcds-kudu-q63.test A testdata/workloads/tpcds/queries/tpcds-kudu-q65.test A testdata/workloads/tpcds/queries/tpcds-kudu-q68.test A testdata/workloads/tpcds/queries/tpcds-kudu-q7.test A testdata/workloads/tpcds/queries/tpcds-kudu-q73.test A testdata/workloads/tpcds/queries/tpcds-kudu-q79.test A testdata/workloads/tpcds/queries/tpcds-kudu-q8.test A testdata/workloads/tpcds/queries/tpcds-kudu-q88.test A testdata/workloads/tpcds/queries/tpcds-kudu-q89.test A testdata/workloads/tpcds/queries/tpcds-kudu-q96.test A testdata/workloads/tpcds/queries/tpcds-kudu-q98.test M tests/comparison/db_connection.py M tests/stress/concurrent_select.py 30 files changed, 2,473 insertions(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/4327/4 -- To view, visit http://gerrit.cloudera.org:8080/4327 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Dimitris TsirogiannisGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Michael Brown
[Impala-ASF-CR] IMPALA-3739: Enable stress tests on Kudu
Dimitris Tsirogiannis has posted comments on this change. Change subject: IMPALA-3739: Enable stress tests on Kudu .. Patch Set 3: (8 comments) http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/bin/load-tpc-kudu.py File testdata/bin/load-tpc-kudu.py: PS3, Line 51: tbls_to_clean = tpch_tables if workload.lower() == 'tpch' else tpcds_tables > Maybe use the cursor to get the list of tables? That way you don't have to The change to enable drop db cascade for Kudu is in review. If you're strongly against simply doing a drop db when this is in, I'll implement your proposal. PS3, Line 81: sql_file_path = "%s/testdata/datasets/%s/%s_kudu_template.sql" > Use os.path.join() here. Done http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/datasets/tpcds/tpcds_kudu_template.sql File testdata/datasets/tpcds/tpcds_kudu_template.sql: PS3, Line 39: 'kudu.key_columns' = 'ss_sold_date_sk,ss_ticket_number, ss_item_sk' > For my education, I looked at http://www.tpc.org/tpc_documents_current_vers No intention here, I just got this wrong :) I actually found a few other inconsistencies. http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/workloads/tpcds/queries/tpcds-kudu-q19.test File testdata/workloads/tpcds/queries/tpcds-kudu-q19.test: Line 39: > I noticed none of the TPC-DS Kudu queries have RESULTS. Why? (I searched fo The TPC-DS workload is not currently enabled for Kudu in our regular (non-stress) testing framework. It requires some extra work which I left for another patch. Once this is done, these queries will also get proper RESULTS and TYPES sections. For now it seemed kind of pointless to add these sections. Besides the stress test has its own result verification mechanism. http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/workloads/tpcds/queries/tpcds-kudu-q47.test File testdata/workloads/tpcds/queries/tpcds-kudu-q47.test: PS3, Line 33: ,round(v1_lead.sum_sales, 2) nsum > Nit: tab character. Done http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/workloads/tpcds/queries/tpcds-kudu-q65.test File testdata/workloads/tpcds/queries/tpcds-kudu-q65.test: PS3, Line 55: order by : s_store_name, : i_item_desc, : sc.revenue, : i_current_price, : i_wholesale_cost, : i_brand > The ORDER BY has more columns than the TPC-DS-for-HDFS counterpart. Any rea I had trouble getting consistent results across multiple invocations of this query unless I included these additional columns here. Added a comment. http://gerrit.cloudera.org:8080/#/c/4327/3/tests/stress/concurrent_select.py File tests/stress/concurrent_select.py: PS3, Line 1463: tpch_kudu_queries = load_tpc_queries("tpch", "kudu") > Change "kudu" to load_in_kudu=True Good catch, thanks. Done PS3, Line 1468: tpcds_kudu_queries = load_tpc_queries("tpcds", "kudu") > Change "kudu" to load_in_kudu=True Done -- To view, visit http://gerrit.cloudera.org:8080/4327 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Dimitris TsirogiannisGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Michael Brown Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4116: Remove 'cdh' from version string
Jim Apple has posted comments on this change. Change subject: IMPALA-4116: Remove 'cdh' from version string .. Patch Set 2: Code-Review+2 I've rethought it and I'm convinced this has a very low probability of causing issues. -- To view, visit http://gerrit.cloudera.org:8080/4421 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7754538a23e73dcdebc6e3df509f357cbe03198c Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars VolkerGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Lars Volker Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
stak...@cloudera.com has uploaded a new patch set (#2). Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. IMPALA-4101: qgen: Hive join predicates should only contains equality functions Background: Hive only supports equi-joins in its JOIN clause, while Postgres and Impala support more complex functions such as <, <=, >, >=, etc. This change modifies the QueryGenerator._create_relational_join_condition and QueryGenerator._create_boolean_func_tree methods to only construct equality join conditions under certain conditions. The _create_boolean_func_tree method is invoked via QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause -> _create_relational_join_condition -> _create_boolean_func_tree. This method is invoked when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of functions that would typically be found in any of these clauses. Changes: The parameter "signatures" is added to the method _create_boolean_func_tree, and it lists out all the allowed signatures the function is allowed to use. Previously, this list of signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if "signatures" is not specified, then the code defaults back to the results of that method. A new method in the DefaultProfile called get_allowed_join_signatures is introduced and returns a list of function signatures that are allowed within a JOIN clause. The DefaultProfile allows all given signatures, while the HiveProfile only allows for the Equals and And functions, as well as any function that operates over only one column. The reason for these restrictions is that Hive only allows equality joins, does not allow OR operators in the join clause, and does not allow any operator that operates over multiple columns. Note that the _create_boolean_func_tree still allows for OR operators due to some logic around it's "and_or_fill_ratio" variable. The plan is to fix this in a future patch that specifically focuses on removing OR operators from Hive JOIN clauses. Minor change to discrepancy_searcher so that the logs print out "Hive" instead of "Impala" when running against Hive. Testing: * Added a new unit test that ensures the HiveProfile only returns equality joins * Tested against Hive locally * Tested against Impala via Leopard * Tested against Impala via the Discrepancy Checker inside an Impala Docker container Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 --- M tests/comparison/discrepancy_searcher.py M tests/comparison/query_generator.py M tests/comparison/query_profile.py A tests/comparison/tests/hive/test_create_join_clause.py 4 files changed, 110 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/2 -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com
[Impala-ASF-CR] IMPALA-3980: qgen: re-enable Hive as a target database
stak...@cloudera.com has posted comments on this change. Change subject: IMPALA-3980: qgen: re-enable Hive as a target database .. Patch Set 5: > Uploaded patch set 5: Patch Set 4 was rebased. @Michael, not sure what happened there, but I just pushed the rebased version. -- To view, visit http://gerrit.cloudera.org:8080/4011 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifb1199b50a5b65c21de7876fb70cc03bda1a9b46 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David KnuppGerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: stak...@cloudera.com Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4116: Remove 'cdh' from version string
Lars Volker has uploaded a new patch set (#2). Change subject: IMPALA-4116: Remove 'cdh' from version string .. IMPALA-4116: Remove 'cdh' from version string Change-Id: I7754538a23e73dcdebc6e3df509f357cbe03198c --- M bin/save-version.sh 1 file changed, 1 insertion(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/21/4421/2 -- To view, visit http://gerrit.cloudera.org:8080/4421 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I7754538a23e73dcdebc6e3df509f357cbe03198c Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars VolkerGerrit-Reviewer: Jim Apple
[Impala-ASF-CR] Make gen build version.py resilient to a failing git rev-parse
Sailesh Mukil has posted comments on this change. Change subject: Make gen_build_version.py resilient to a failing git rev-parse .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/4411/3/bin/gen_build_version.py File bin/gen_build_version.py: Line 35: # Redirecting stdout and stderr to os.devnull > nit: You could remove both parameters altogether, since None is the default Didn't realize this earlier, but None actually doesn't redirect stdout or stderr. So it does end up getting printed on stdout/stderr. I've added a change to redirect to os.devnull. -- To view, visit http://gerrit.cloudera.org:8080/4411 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id7af33502bbb70185dc15ffca6219436a616f25b Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: Yes
[Impala-ASF-CR] Make gen build version.py resilient to a failing git rev-parse
Lars Volker has posted comments on this change. Change subject: Make gen_build_version.py resilient to a failing git rev-parse .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/4411/3/bin/gen_build_version.py File bin/gen_build_version.py: Line 35: can_obtain_git_hash = call(['git', 'rev-parse', 'HEAD'], stdout=None, stderr=None) == 0 nit: You could remove both parameters altogether, since None is the default for both. -- To view, visit http://gerrit.cloudera.org:8080/4411 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id7af33502bbb70185dc15ffca6219436a616f25b Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: Yes
[Impala-ASF-CR] Make gen build version.py resilient to a failing git rev-parse
Lars Volker has posted comments on this change. Change subject: Make gen_build_version.py resilient to a failing git rev-parse .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/4411/2/bin/gen_build_version.py File bin/gen_build_version.py: Line 35: can_obtain_git_hash = call(['git', 'rev-parse', 'HEAD'], stdout=PIPE, stderr=PIPE) == 0 Python docs discourage using PIPE here as it can block, depending on the output volume of the subprocess (https://docs.python.org/2/library/subprocess.html). However, in your case you're not reading the output, so you could just use None for both instead, which is also the default. Line 43: # SAVE_VERSION_SCRIPT will generate a dummy version.info file if we cannot obtain the Maybe replace L54 with this? http://gerrit.cloudera.org:8080/#/c/4411/2/bin/save-version.sh File bin/save-version.sh: Line 25: GIT_HASH=$(git rev-parse HEAD) > /dev/null I think this should be 2> /dev/null, no? -- To view, visit http://gerrit.cloudera.org:8080/4411 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id7af33502bbb70185dc15ffca6219436a616f25b Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-1654: General partition exprs in DDL operations.
Alex Behm has posted comments on this change. Change subject: IMPALA-1654: General partition exprs in DDL operations. .. Patch Set 11: (6 comments) Amos, sorry to meddle with your patch but it seemed easier/faster this way since I have no idea what's wrong with your env. I fixed the AuthorizationTest and AuditingTest for you. Please look at the changes carefully and try to apply the same principles in other places in your patch. I added comments to explain what the bug was. I'm still looking at those other failing tests. Hopefully we can fix your env so you can repro yourself. http://gerrit.cloudera.org:8080/#/c/3942/11/fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/AlterTableStmt.java: Line 70: public void analyze(Analyzer analyzer) throws AnalysisException { The changes here lead to a bunch of changes in error messages which I think are fine. Can you fix the expected error messages in the FE tests, e.g., AnalyzeDDLTest? Line 72: TableRef tableRef = new TableRef(tableName_.toPath(), null, Privilege.ALTER); This version is better because: * we only ask the catalog for this table once, avoiding potential inconsistency problems * we register the appropriate audit and auth events exactly once in tableRef.analyze() * more streamlined code Line 80: if (tableRef instanceof CollectionTableRef) { Could you add a test for this? http://gerrit.cloudera.org:8080/#/c/3942/10/fe/src/main/java/com/cloudera/impala/analysis/ShowFilesStmt.java File fe/src/main/java/com/cloudera/impala/analysis/ShowFilesStmt.java: Line 61: TableRef tableRef = new TableRef(tableName_.toPath(), null, Privilege.VIEW_METADATA); I'm explaining the auth/audit bug in the following comments. Line 65: if (!(table_ instanceof HdfsTable)) { This registers audit and auth events for the VIEW_METADATA privilege. Line 74: partitionSet_.setPartitionShouldExist(); This registers audit and auth events for the SELECT privilege - which is wrong and caused tests to fail. -- To view, visit http://gerrit.cloudera.org:8080/3942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Amos BirdGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Amos Bird Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-1654: General partition exprs in DDL operations.
Alex Behm has uploaded a new patch set (#11). Change subject: IMPALA-1654: General partition exprs in DDL operations. .. IMPALA-1654: General partition exprs in DDL operations. This commit handles partition related DDL in a more general way. We can now use compound predicates to specify a list of partitions in statements like ALTER TABLE DROP PARTITION and COMPUTE INCREMENTAL STATS, etc. It will also make sure some statements only accept one partition at a time, such as PARTITION SET LOCATION and LOAD DATA. ALTER TABLE ADD PARTITION remains using the old PartitionKeyValue's logic. The changed partition related DDLs are as follows, Table: p (i int) partitioned by (j int, k string) Partitions: +---+---+---++--+--+---+ | j | k | #Rows | #Files | Size | Bytes Cached | Cache Replication | +---+---+---++--+--+---+ | 1 | a | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 1 | b | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 1 | c | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 2 | d | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 2 | e | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 2 | f | -1| 0 | 0B | NOT CACHED | NOT CACHED| | Total | | -1| 0 | 0B | 0B | | +---+---+---++--+--+---+ 1. show files in p partition (j<2, k='a'); 2. alter table p partition (j<2, k in ("b","c") set cached in 'testPool'; // j can appear more than once, 3.1. alter table p partition (j<2, j>0, k<>"d") set uncached; // it is the same as 3.2. alter table p partition (j<2 and j>0, not k="e") set uncached; // we can also do 'or' 3.3. alter table p partition (j<2 or j>0, k like "%") set uncached; // missing 'k' matches all values of k 4. alter table p partition (j<2) set fileformat textfile; 5. alter table p partition (k rlike ".*") set serdeproperties ("k"="v"); 6. alter table p partition (j is not null) set tblproperties ("k"="v"); 7. alter table p drop partition (j<2); 8. compute incremental stats p partition(j<2); The remaining old partition related DDLs are as follows, 1. load data inpath '/path/from' into table p partition (j=2, k="d"); 2. alter table p add partition (j=2, k="g"); 3. alter table p partition (j=2, k="g") set location '/path/to'; 4. insert into p partition (j=2, k="g") values (1), (2), (3); General partition expressions or partially specified partition specs allows partition predicates to return empty partition set no matter 'IF EXISTS' is specified. Examples: [localhost.localdomain:21000] > alter table p drop partition (j=2, k="f"); Query: alter table p drop partition (j=2, k="f") +-+ | summary | +-+ | Dropped 1 partition(s). | +-+ Fetched 1 row(s) in 0.78s [localhost.localdomain:21000] > alter table p drop partition (j=2, k<"f"); Query: alter table p drop partition (j=2, k<"f") +-+ | summary | +-+ | Dropped 2 partition(s). | +-+ Fetched 1 row(s) in 0.41s [localhost.localdomain:21000] > alter table p drop partition (k="a"); Query: alter table p drop partition (k="a") +-+ | summary | +-+ | Dropped 1 partition(s). | +-+ Fetched 1 row(s) in 0.25s [localhost.localdomain:21000] > show partitions p; Query: show partitions p +---+---+---++--+--+---+ | j | k | #Rows | #Files | Size | Bytes Cached | Cache Replication | +---+---+---++--+--+---+ | 1 | b | -1| 0 | 0B | NOT CACHED | NOT CACHED| | 1 | c | -1| 0 | 0B | NOT CACHED | NOT CACHED| | Total | | -1| 0 | 0B | 0B | | +---+---+---++--+--+---+ Fetched 3 row(s) in 0.01s Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f --- M be/src/service/query-exec-state.cc M be/src/service/query-exec-state.h M common/thrift/CatalogService.thrift M common/thrift/Frontend.thrift M common/thrift/JniCatalog.thrift M fe/src/main/cup/sql-parser.cup M fe/src/main/java/com/cloudera/impala/analysis/AlterTableDropPartitionStmt.java M fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetCachedStmt.java M fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetFileFormatStmt.java M fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetLocationStmt.java M fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetStmt.java M fe/src/main/java/com/cloudera/impala/analysis/AlterTableSetTblProperties.java M
[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
Tim Armstrong has posted comments on this change. Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder .. Patch Set 14: (53 comments) http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/analytic-eval-node.cc File be/src/exec/analytic-eval-node.cc: Line 207: state, Substitute(PREPARE_FAILED_ERROR_MSG, "write")); > old code and agg uses state_->block_mgr()->MemLimitTooLowError(). Any reas Historical reasons it looks like: the PrepareForRead() version below diverged from the PAGG/PHJ versions. Switched to the block_mgr version for consistency. http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/blocking-join-node.cc File be/src/exec/blocking-join-node.cc: PS14, Line 297: SCOPED_TIMER(build_timer_); > We used to have two different timers in PHJ for measuring the time to hash It's not clear to me that the original timers were well thought out. It didn't look like there was a timer for how long it took to build the hash table - that was lumped into 'build_timer_'. I added two new timers with clearer meanings that track the total hash table build timer and time spent partitioning rows. Also moved the remaining build-related timers into the Builder. Now 'built_timer_' is the initial partitioning and hash table build, and 'repartition_timer_' is the time spent repartitioning'. http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: Line 179: largest_fraction = max(largest_fraction, > DCHECK_GT(num_build_rows, 0); I don't think that's valid, it's casting to double to avoid crashing on the divide-by-zero. I changed it to avoid the calculation if num_build_rows = 0 to avoid that problem. PS14, Line 318: a hash table > hash tables Done. Both alternatives seem grammatically correct to me though. Line 341: // building hash tables to avoid building hash tables for partitions we'll spill anyway. > That's an interesting comment. If I understand it correctly, it means that Done with some slight tweaks. This is a lot simpler than the previous behaviour, where spilling can happen even during probing. I didn't measure how often, but it's not at all improbable, since the probe buffers are 8MB each and we could allocate up to 16 of them. Line 401: RETURN_IF_ERROR(SpillPartition()); > May help to document that failure to find any partition to spill (e.g. all I added to the comment up the top of the loop to make the termination more explicit. PS14, Line 529: PhjBuilder::HashTableStoresNulls() > This seems to belong to PartitionedHashJoinNode conceptually. I feel like it makes sense here though since the builder owns the hash tables. Plumbing-wise it's also easier since PhjBuilder needs this info and starts executing before PartitionedHashJoinNode. Line 646: do { > Please see comments in BlockingJoinNode, it would be great to retain timer Done PS14, Line 782: process_build_batch_fn_level0 > 'process_build_batch_fn_level0_' I just copied this verbatim - it seems to be referring to the local variable though. http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.h File be/src/exec/partitioned-hash-join-builder.h: Line 425: > nit: unnecessary blank line ? Done http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-node.cc File be/src/exec/partitioned-hash-join-node.cc: PS14, Line 151: : : > This may be important information to retain. Not sure why I removed it. May have had a good reason but I can't recall it. PS14, Line 87: Expr::CreateExprTree(pool_, eq_join_conjunct.right > Does this overlap with CreateExprTree() in PhjBuilder::Init() ? Same questi They are redundant currently, but the idea is that PhjBuilder and PhjNode will exist more independently in different plan fragments, so I wanted to encapsulate the state where possible. The expr contexts need to be independent to eventually support many probe sides : 1 build side for broadcast joins. I think documenting this explicitly may be confusing because it is sort-of explaining the current state of things relative to the previous way of doing things. Whereas I think with the separate build side, the default assumption is that builder data structures are private and not shared with the exec node. It may make sense to share Expr trees globally between fragment instances as part of the multithreading changes, but I don't think it's worth trying to share them here until we have the right infrastructure. Line 213: for (unique_ptr& partition : spilled_partitions_) > missing { } Ah, missed that clang-format wrapped it. PS14, Line 494: to output > outputting Done PS14, Line 585: hash_partitions_ > 'hash_partitions_' Done Line 594: continue; > Is there a reason why we cannot call OutputUnmatchedBuild() directly at thi I
[Impala-ASF-CR] IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables .. Patch Set 1: (28 comments) Ok, here's a bit to get started. Just made my way through most of analysis. Still have a lot to get through but figured I'll give you the feedback I have while I do a few other things. http://gerrit.cloudera.org:8080/#/c/4414/1//COMMIT_MSG Commit Message: PS1, Line 21: When attempting to create an external : table "foo" in database "bar", Impala will search for a Kudu table : name "foo.bar" and "bar" (Kudu doesn't have database name spaces : yet.) I think this sentence is duplicated by #10. Maybe remove this part. This item is about reading the columns, right? The issue of table name resolution is separate IMO, and covered by #10, though consider moving them next together :) PS1, Line 25: The Kudu table is now required to exist at the time of creation in : Impala. for external tables? or if this is more general can you explain a bit more? PS1, Line 31: default value can it still be overridden? PS1, Line 39: Eventually Hive should have the needed : functionality and the Kudu delegate (renamed in this patch to KuduCatalogOpExecutor) : can be removed. maybe, it's not clear if impala would want to keep the ability to do this on its own... Remove? PS1, Line 54: If the database is "default" then : the Kudu table name will not include the database. hm... I wonder if it'd be easier just to keep it standardized and always have the name, i.e. using default in this case. What do you think? I'll see if I change my mind after reading more code. PS1, Line 56: 11) Several improvements in the grammar related to the family : of CREATE TABLE statements. I don't think this bullet adds much value to the commit message. Everything else is more user visible, this is just some necessary code cleanup. PS1, Line 58: 12) Added new tests and modified existing Kudu test to use the new : CREATE TABLE syntax. I think this can be removed too, it should be implied there are tests and it takes away from the billion other "features" in this patch :) PS1, Line 60: source of truth Does this mean we always hit Kudu for metadata? In the Catalog? In the impalad's Catalog? PS1, Line 60: insteads instead PS1, Line 64: Not included in this commit: : - Additional column properties such as nullability and compression : encodings. I don't think you need to include this, we have a separate JIRA for this. http://gerrit.cloudera.org:8080/#/c/4414/1/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: FYI I'm not looking at this very carefully because it was reviewed previously https://gerrit.cloudera.org/#/c/2865/ Let me know if this has changed much since then. PS1, Line 405: view_column_def_list, view_column_defs; separate nonterminals if they dont fit? PS1, Line 458: key_ident; does this not need Boolean like those below? http://gerrit.cloudera.org:8080/#/c/4414/1/fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java File fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java: I'm not in love with the TableDef and TableDefOptions stuff because it feels like that's parser implementation concerns bleeding into analysis code. Do you think we need both? Can we fold the latter into TableDef? If we have to keep both, it'd be helpful to at least group the functions in this file based on their source (e.g. TableDef, TDOptions, or on this itself) with comments around the groups? Maybe within those groups they could also be sorted by visibility? There's a lot of stuff going on so I think it'd be helpful to keep it tidy. PS1, Line 240: getColumnDefs().add(KuduCatalogOpExecutor.REQUIRED_THRIFT_COLUMN_DEF); is this needed? I thought there was going to be code that fetches the columns from Kudu and stores them. PS1, Line 242: if (!getTblProperties().containsKey(KuduTable.KEY_TABLE_NAME)) { : getTblProperties().put(KuduTable.KEY_TABLE_NAME, : KuduUtil.getDefaultCreateKuduTableName(getDb(), getTbl())); : } hm... can we simplify this and only allow internal tables to be created with the name db_name.table_name? Obviously there could be conflicts and it should fail then. Then it'd be 1 less case to consider. Also, I wonder if we should prefix the underlying kudu table names with something, e.g. IMPALA_db.table? PS1, Line 275: // Kudu's data distribution rules are enforced here. A good reference is : // http://getkudu.io/docs/schema_design.html#data-distribution. As a summary: : // * Only primary key column can be used in distribution definitions :
[Impala-ASF-CR] IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh
Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh .. IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh We used to include a step in run-hbase.sh for calling a python script that queried Zookeeper to see if the HBase master was up. The original script was problematic, so we stopped using it during our mini-cluster HBase start up procedure. HBase start up issues continue to plague us, however. This patch reintroduces a Zookeeper check, with the following updates: - replace the original script with check-hbase-nodes.py - query the correct node /hbase/master, not just /hbase/rs - use the python Zookeeper library kazoo, rather than calling out to the shell and parsing the return string - since we are moving toward testing on a remote cluster, also add the capability to pass in the address for the host that provides the Zookeeper and HBase services - add an additional check that the HDFS service is running, because of an edge case where the HBase master can briefly start without a cluster running. In addition to the expected tests, this script was also tested under the conditions of IMPALA-4088, whereby the HBase RegionServer is running, but the master fails because another listening process has already taken its TCP port (60010) during startup. Change-Id: I9b81f3cfb6ea0ba7b18ce5fcd5d268f515c8b0c3 Reviewed-on: http://gerrit.cloudera.org:8080/4348 Reviewed-by: Alex BehmTested-by: Internal Jenkins --- M infra/python/deps/requirements.txt A testdata/bin/check-hbase-nodes.py M testdata/bin/run-hbase.sh D testdata/bin/wait-for-hbase-master.py 4 files changed, 176 insertions(+), 59 deletions(-) Approvals: Internal Jenkins: Verified Alex Behm: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/4348 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I9b81f3cfb6ea0ba7b18ce5fcd5d268f515c8b0c3 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: David Knupp Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: David Knupp Gerrit-Reviewer: Harrison Sheinblatt Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Ishaan Joshi Gerrit-Reviewer: Michael Brown
[Impala-ASF-CR] IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh
Internal Jenkins has posted comments on this change. Change subject: IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh .. Patch Set 9: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/4348 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b81f3cfb6ea0ba7b18ce5fcd5d268f515c8b0c3 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: David KnuppGerrit-Reviewer: Alex Behm Gerrit-Reviewer: David Knupp Gerrit-Reviewer: Harrison Sheinblatt Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Ishaan Joshi Gerrit-Reviewer: Michael Brown Gerrit-HasComments: No
[Impala-ASF-CR] Make gen build version.py resilient to a failing git rev-parse
Sailesh Mukil has posted comments on this change. Change subject: Make gen_build_version.py resilient to a failing git rev-parse .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/4411/1/bin/gen_build_version.py File bin/gen_build_version.py: PS1, Line 34: can_obtain_git_hash = False if os.system('git rev-parse HEAD') else True > This took me a second glance to understand, maybe replace with Done PS1, Line 39: if not version_file_exists or can_obtain_git_hash: > what happens if we cannot obtain the git hash and don't have a version file Done http://gerrit.cloudera.org:8080/#/c/4411/1/bin/save-version.sh File bin/save-version.sh: Line 25: GIT_HASH=$(git rev-parse HEAD) > Discard stderr? Done -- To view, visit http://gerrit.cloudera.org:8080/4411 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id7af33502bbb70185dc15ffca6219436a616f25b Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: Yes
[Impala-ASF-CR] Make gen build version.py resilient to a failing git rev-parse
Sailesh Mukil has uploaded a new patch set (#2). Change subject: Make gen_build_version.py resilient to a failing git rev-parse .. Make gen_build_version.py resilient to a failing git rev-parse It was noticed that some build processes did not checkout Impala and instead built it from a tarball. This would cause our gen_build_version script to write a blank version info everytime to the version.info file. This patch takes care of the case where if there is an already existing version.info file and we cannot get the git rev-parse output, we use the old file instead. Blank version info is written only when we don't have an old version.info file and we cannot do a git rev-parse. Change-Id: Id7af33502bbb70185dc15ffca6219436a616f25b --- M bin/gen_build_version.py M bin/save-version.sh 2 files changed, 27 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/4411/2 -- To view, visit http://gerrit.cloudera.org:8080/4411 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id7af33502bbb70185dc15ffca6219436a616f25b Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh MukilGerrit-Reviewer: Lars Volker
[Impala-ASF-CR] IMPALA-3739: Enable stress tests on Kudu
Michael Brown has posted comments on this change. Change subject: IMPALA-3739: Enable stress tests on Kudu .. Patch Set 3: (8 comments) http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/bin/load-tpc-kudu.py File testdata/bin/load-tpc-kudu.py: PS3, Line 51: tbls_to_clean = tpch_tables if workload.lower() == 'tpch' else tpcds_tables Maybe use the cursor to get the list of tables? That way you don't have to hardcode the table names L39-46. PS3, Line 81: sql_file_path = "%s/testdata/datasets/%s/%s_kudu_template.sql" Use os.path.join() here. http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/datasets/tpcds/tpcds_kudu_template.sql File testdata/datasets/tpcds/tpcds_kudu_template.sql: PS3, Line 39: 'kudu.key_columns' = 'ss_sold_date_sk,ss_ticket_number, ss_item_sk' For my education, I looked at http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.3.0.pdf and saw that for this table, the PK is ss_item_sk,ss_ticket_number . Can you explain why ss_sold_date_sk is added as a key column? http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/workloads/tpcds/queries/tpcds-kudu-q19.test File testdata/workloads/tpcds/queries/tpcds-kudu-q19.test: Line 39: I noticed none of the TPC-DS Kudu queries have RESULTS. Why? (I searched for a TODO and didn't see a reason that might explain it; maybe I missed it.) http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/workloads/tpcds/queries/tpcds-kudu-q47.test File testdata/workloads/tpcds/queries/tpcds-kudu-q47.test: PS3, Line 33: ,round(v1_lead.sum_sales, 2) nsum Nit: tab character. http://gerrit.cloudera.org:8080/#/c/4327/3/testdata/workloads/tpcds/queries/tpcds-kudu-q65.test File testdata/workloads/tpcds/queries/tpcds-kudu-q65.test: PS3, Line 55: order by : s_store_name, : i_item_desc, : sc.revenue, : i_current_price, : i_wholesale_cost, : i_brand The ORDER BY has more columns than the TPC-DS-for-HDFS counterpart. Any reason? http://gerrit.cloudera.org:8080/#/c/4327/3/tests/stress/concurrent_select.py File tests/stress/concurrent_select.py: PS3, Line 1463: tpch_kudu_queries = load_tpc_queries("tpch", "kudu") Change "kudu" to load_in_kudu=True PS3, Line 1468: tpcds_kudu_queries = load_tpc_queries("tpcds", "kudu") Change "kudu" to load_in_kudu=True -- To view, visit http://gerrit.cloudera.org:8080/4327 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Dimitris TsirogiannisGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Michael Brown Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3718: Support subset of functional-query for Kudu
Internal Jenkins has posted comments on this change. Change subject: IMPALA-3718: Support subset of functional-query for Kudu .. Patch Set 5: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/4175 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew JacobsGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Michael Brown Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3718: Support subset of functional-query for Kudu
Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-3718: Support subset of functional-query for Kudu .. IMPALA-3718: Support subset of functional-query for Kudu Adds initial support for the functional-query test workload for Kudu tables. There are a few issues that make loading the functional schema difficult on Kudu: 1) Kudu tables must have one or more columns that together constitute a unique primary key. a) Primary key columns must currently be the first columns in the table definition (KUDU-1271). b) Primary key columns cannot be nullable (KUDU-1570). 2) Kudu tables must be specified with distribution parameters. (1) limits the tables that can be loaded without ugly workarounds. This patch only includes important tables that are used for relevant tests, most notably the alltypes* family. In particular, alltypesagg is important but it does not have a set of columns that are non-nullable and form a unique primary key. As a result, that table is created in Kudu with a different name and an additional BIGINT column for a PK that is a unique index and is generated at data loading time using the ROW_NUMBER analytic function. A view is then wrapped around the underlying table that matches the alltypesagg schema exactly. When KUDU-1570 is resolved, this can be simplified. (2) requires some additional considerations and custom syntax. As a result, the DDL to create the tables is explicitly specified in CREATE_KUDU sections in the functional_schema_constraints.csv, and an additional DEPENDENT_LOAD_KUDU section was added to specify custom data loading DML that differs from the existing DEPENDENT_LOAD. TODO: IMPALA-4005: generate_schema_statements.py needs refactoring Tests that are not relevant or not yet supported have been marked with xfail and a skip where appropriate. TODO: Support remaining functional tables/tests when possible. Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Reviewed-on: http://gerrit.cloudera.org:8080/4175 Reviewed-by: Matthew JacobsTested-by: Internal Jenkins --- M bin/impala-config.sh M testdata/bin/compute-table-stats.sh M testdata/bin/generate-schema-statements.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/kudu-delete.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/functional-query/queries/QueryTest/aggregation.test M tests/common/skip.py M tests/common/test_result_verifier.py M tests/query_test/test_aggregation.py M tests/query_test/test_exprs.py M tests/query_test/test_queries.py M tests/query_test/test_runtime_filters.py M tests/query_test/test_scanners.py M tests/query_test/test_tpcds_queries.py 19 files changed, 343 insertions(+), 80 deletions(-) Approvals: Matthew Jacobs: Looks good to me, approved Internal Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/4175 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Michael Brown
[Impala-ASF-CR] IMPALA-4110: Clean up issues found by Apache RAT.
Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-4110: Clean up issues found by Apache RAT. .. IMPALA-4110: Clean up issues found by Apache RAT. Change-Id: I5bfe77f9a871018e7a67553ed270e2df53006962 Reviewed-on: http://gerrit.cloudera.org:8080/4361 Reviewed-by: Alex BehmTested-by: Internal Jenkins --- M LICENSE.txt M be/.astylerc M be/.impala.doxy M be/src/benchmarks/expr-benchmark.cc M be/src/bufferpool/CMakeLists.txt M be/src/experiments/data-provider.cc M be/src/experiments/hashing/cache-hash-test.cc M be/src/experiments/hashing/growing-test.cc M be/src/experiments/hashing/interface/cache-hash-test.cc M be/src/experiments/hashing/interface/growing-test.cc M be/src/experiments/hashing/multilevel/cache-hash-test.cc M be/src/experiments/hashing/multilevel/growing-test.cc M be/src/experiments/hashing/prefetch/cache-hash-test.cc M be/src/experiments/hashing/prefetch/growing-test.cc M be/src/experiments/hashing/split-benchmarks/cache-hash-test.cc M be/src/experiments/hashing/split-benchmarks/growing-test.cc M be/src/experiments/hashing/split-benchmarks/partitioning-throughput-test.cc M be/src/experiments/hashing/streaming/cache-hash-test.cc M be/src/experiments/hashing/streaming/growing-test.cc M be/src/testutil/certificates-info.txt M be/src/util/asan.h M bin/README-RUNNING-BENCHMARKS M bin/check-rat-report.py D bin/cpplint.py M bin/impala-ipython M bin/impala-py.test M bin/impala-python M bin/rat_exclude_files.txt M cmake_modules/FindAvro.cmake M cmake_modules/FindBreakpad.cmake M cmake_modules/FindBzip2.cmake M cmake_modules/FindGFlags.cmake M cmake_modules/FindGLog.cmake M cmake_modules/FindHDFS.cmake M cmake_modules/FindJNI.cmake M cmake_modules/FindLdap.cmake M cmake_modules/FindLlvm.cmake M cmake_modules/FindLlvmBinaries.cmake M cmake_modules/FindLz4.cmake M cmake_modules/FindOpenSSL.cmake M cmake_modules/FindPProf.cmake M cmake_modules/FindRapidJson.cmake M cmake_modules/FindRe2.cmake M cmake_modules/FindSnappy.cmake M cmake_modules/FindThrift.cmake M cmake_modules/FindZlib.cmake M fe/src/main/java/com/cloudera/impala/analysis/AggregateInfoBase.java M fe/src/main/java/com/cloudera/impala/analysis/ExprSubstitutionMap.java M fe/src/main/java/com/cloudera/impala/analysis/PartitionSpec.java M fe/src/main/java/com/cloudera/impala/catalog/ArrayType.java M fe/src/main/java/com/cloudera/impala/catalog/MapType.java M fe/src/main/java/com/cloudera/impala/catalog/StructType.java M fe/src/main/java/com/cloudera/impala/planner/Planner.java M fe/src/test/resources/authz-policy.ini.template M fe/src/test/resources/hive-log4j.properties.template M fe/src/test/resources/log4j.properties.template M infra/python/deps/download_requirements M infra/python/deps/requirements.txt M testdata/avro_schema_resolution/create_table.sql M testdata/bin/cache_tables.py M testdata/bin/create-mini.sql M testdata/bin/load-dependent-tables.sql M testdata/bin/load-hive-builtins.sh M testdata/bin/run-hive.sh M testdata/cluster/node_templates/cdh5/etc/init.d/kms M testdata/cluster/node_templates/cdh5/etc/init.d/kudu-common M testdata/cluster/node_templates/cdh5/etc/init.d/kudu-master M testdata/cluster/node_templates/cdh5/etc/init.d/kudu-tserver M testdata/cluster/node_templates/cdh5/etc/init.d/llama-application M testdata/cluster/node_templates/common/etc/init.d/common.tmpl M testdata/cluster/node_templates/common/etc/init.d/hdfs-common M testdata/cluster/node_templates/common/etc/init.d/hdfs-datanode M testdata/cluster/node_templates/common/etc/init.d/hdfs-namenode M testdata/cluster/node_templates/common/etc/init.d/yarn-common M testdata/cluster/node_templates/common/etc/init.d/yarn-nodemanager M testdata/cluster/node_templates/common/etc/init.d/yarn-resourcemanager D testdata/data/mstr/eatwh1/EATWH1-DDL.sql D testdata/data/mstr/eatwh1/LoadTablesEATWH1.sql D testdata/data/mstr/eatwh1/csv/COST_MARKET_CLASS.TXT D testdata/data/mstr/eatwh1/csv/COST_MARKET_DEP.TXT D testdata/data/mstr/eatwh1/csv/COST_MARKET_DIV.TXT D testdata/data/mstr/eatwh1/csv/COST_REGION_CLASS.TXT D testdata/data/mstr/eatwh1/csv/COST_REGION_ITEM.TXT D testdata/data/mstr/eatwh1/csv/COST_STORE_DEP.TXT D testdata/data/mstr/eatwh1/csv/COST_STORE_ITEM.TXT D testdata/data/mstr/eatwh1/csv/FT1.TXT D testdata/data/mstr/eatwh1/csv/FT10.TXT D testdata/data/mstr/eatwh1/csv/FT11.TXT D testdata/data/mstr/eatwh1/csv/FT12.TXT D testdata/data/mstr/eatwh1/csv/FT13.TXT D testdata/data/mstr/eatwh1/csv/FT14.TXT D testdata/data/mstr/eatwh1/csv/FT15.TXT D testdata/data/mstr/eatwh1/csv/FT17.TXT D testdata/data/mstr/eatwh1/csv/FT2.TXT D testdata/data/mstr/eatwh1/csv/FT3.TXT D testdata/data/mstr/eatwh1/csv/FT4.TXT D testdata/data/mstr/eatwh1/csv/FT5.TXT D testdata/data/mstr/eatwh1/csv/FT6.TXT D testdata/data/mstr/eatwh1/csv/FT7.TXT D testdata/data/mstr/eatwh1/csv/FT8.TXT D testdata/data/mstr/eatwh1/csv/FT9.TXT D testdata/data/mstr/eatwh1/csv/INVENTORY_CURR.TXT D
[Impala-ASF-CR] IMPALA-4110: Clean up issues found by Apache RAT.
Internal Jenkins has posted comments on this change. Change subject: IMPALA-4110: Clean up issues found by Apache RAT. .. Patch Set 8: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/4361 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5bfe77f9a871018e7a67553ed270e2df53006962 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh
Harrison Sheinblatt has posted comments on this change. Change subject: IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/4348/9/testdata/bin/check-hbase-nodes.py File testdata/bin/check-hbase-nodes.py: Line 135: zk_client.stop() A problem is that zk client wants you to stop() and then close(). In this case, if there is an exception or something that gets through, the close() will be called without stop(). This may work without a 'graceful shutdown' of the client, and be OK. I've done a 2-phase context manager in the past to ensure stop() is called before close(). -- To view, visit http://gerrit.cloudera.org:8080/4348 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b81f3cfb6ea0ba7b18ce5fcd5d268f515c8b0c3 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: David KnuppGerrit-Reviewer: Alex Behm Gerrit-Reviewer: David Knupp Gerrit-Reviewer: Harrison Sheinblatt Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Ishaan Joshi Gerrit-Reviewer: Michael Brown Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3739: Enable stress tests on Kudu
Dimitris Tsirogiannis has uploaded a new patch set (#3). Change subject: IMPALA-3739: Enable stress tests on Kudu .. IMPALA-3739: Enable stress tests on Kudu This commit modifies the stress test framework to run TPC-H and TPC-DS workloads against Kudu. The follwing changes are included in this commit: 1. Created template files with DDL and DML statements for loading TPC-H and TPC-DS data in Kudu 2. Created a script (load-tpc-kudu.py) to load data in Kudu. The script is invoked by the stress test runner to load test data in an existing Impala/Kudu cluster (both local and CM-managed clusters are supported). 3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL files with TPC-H queries for Kudu were added in a previous patch. 4. Modified the stress test runner to take additional parameters specific to Kudu (e.g. kudu master addr) The stress test runner for Kudu was tested on EC2 clusters for both TPC-H and TPC-DS workloads. Missing functionality: * No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu. * Not all supported TPC-DS queries are included. Currently, only the TPC-DS queries from the testdata/workloads/tpcds/queries directory were modified to run against Kudu. Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 --- A testdata/bin/load-tpc-kudu.py A testdata/datasets/tpcds/tpcds_kudu_template.sql A testdata/datasets/tpch/tpch_kudu_template.sql A testdata/workloads/tpcds/queries/tpcds-kudu-q19.test A testdata/workloads/tpcds/queries/tpcds-kudu-q27.test A testdata/workloads/tpcds/queries/tpcds-kudu-q3.test A testdata/workloads/tpcds/queries/tpcds-kudu-q34.test A testdata/workloads/tpcds/queries/tpcds-kudu-q42.test A testdata/workloads/tpcds/queries/tpcds-kudu-q43.test A testdata/workloads/tpcds/queries/tpcds-kudu-q46.test A testdata/workloads/tpcds/queries/tpcds-kudu-q47.test A testdata/workloads/tpcds/queries/tpcds-kudu-q52.test A testdata/workloads/tpcds/queries/tpcds-kudu-q53.test A testdata/workloads/tpcds/queries/tpcds-kudu-q55.test A testdata/workloads/tpcds/queries/tpcds-kudu-q59.test A testdata/workloads/tpcds/queries/tpcds-kudu-q6.test A testdata/workloads/tpcds/queries/tpcds-kudu-q61.test A testdata/workloads/tpcds/queries/tpcds-kudu-q63.test A testdata/workloads/tpcds/queries/tpcds-kudu-q65.test A testdata/workloads/tpcds/queries/tpcds-kudu-q68.test A testdata/workloads/tpcds/queries/tpcds-kudu-q7.test A testdata/workloads/tpcds/queries/tpcds-kudu-q73.test A testdata/workloads/tpcds/queries/tpcds-kudu-q79.test A testdata/workloads/tpcds/queries/tpcds-kudu-q8.test A testdata/workloads/tpcds/queries/tpcds-kudu-q88.test A testdata/workloads/tpcds/queries/tpcds-kudu-q89.test A testdata/workloads/tpcds/queries/tpcds-kudu-q96.test A testdata/workloads/tpcds/queries/tpcds-kudu-q98.test M tests/comparison/db_connection.py M tests/stress/concurrent_select.py 30 files changed, 2,478 insertions(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/4327/3 -- To view, visit http://gerrit.cloudera.org:8080/4327 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Dimitris TsirogiannisGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Michael Brown
[Impala-ASF-CR] IMPALA-4110: Apache RAT script on Impala tarballs.
Jim Apple has posted comments on this change. Change subject: IMPALA-4110: Apache RAT script on Impala tarballs. .. Patch Set 1: Verified+1 No GVO implications, so self-verifying -- To view, visit http://gerrit.cloudera.org:8080/4405 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic95bd38fbb90f9a901602dd91cee541b16bf4714 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4110: Apache RAT script on Impala tarballs.
Jim Apple has submitted this change and it was merged. Change subject: IMPALA-4110: Apache RAT script on Impala tarballs. .. IMPALA-4110: Apache RAT script on Impala tarballs. Apache RAT is a tool for license auditing. It will be used as part of the Apache release process. This patch includes a script for parsing its output and a file containing a list of filename globs that should be ignored. The script takes two command line parameters as input - the filename of the ignored file globs, and the filename of the RAT xml output. Change-Id: Ic95bd38fbb90f9a901602dd91cee541b16bf4714 Reviewed-on: http://gerrit.cloudera.org:8080/4405 Reviewed-by: Alex BehmTested-by: Jim Apple --- A bin/check-rat-report.py A bin/rat_exclude_files.txt 2 files changed, 191 insertions(+), 0 deletions(-) Approvals: Jim Apple: Verified Alex Behm: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/4405 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ic95bd38fbb90f9a901602dd91cee541b16bf4714 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim Apple Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Todd Lipcon