Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8894 )
Change subject: PREVIEW: IMPALA-6372: Go parallel for Hive dataload ...................................................................... Patch Set 6: (10 comments) http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py File bin/load-data.py: http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@a114 PS6, Line 114: : : : : : : : : : This comment is still needed with Impyla HS2, add to exec_hive_query_from_file. http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@42 PS6, Line 42: num_processes = multiprocessing.cpu_count() It might make sense for this to be a command line argument. http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@151 PS6, Line 151: with connect(host=hs2_host, port=hs2_port, auth_mechanism="PLAIN", \ : user=getpass.getuser(), database="default") as conn: Currently only works for non-secure cluster. This a regression compared to the Beeline execution, but even the current Beeline command doesn't work if the cluster has SSL. We never use this script on a secure cluster now, but we will want to in future. Revert to Beeline or add long comment about making this work with secure. http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@163 PS6, Line 163: out_file.write("ERROR: {0}\n{1}\n".format(query, str(e))) Might be nice to rename the file to be distinctive in the case of an error. This would make debugging an error easier. http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@275 PS6, Line 275: print "Execution of {0} SQL file {1} failed.".format(execution_type, query_file) This should include information about the log file to look at. http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@280 PS6, Line 280: def get_files_by_pattern(directory_contents, file_pattern): : return [f for f in directory_contents if file_pattern in f] Unused. Remove. http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@315 PS6, Line 315: #os.chdir(sql_dir) Remove and clarify the need for absolute paths rather than relative paths. http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@336 PS6, Line 336: # Use maximum parallelism for this part, as these are almost entirely metadata : # operations. Comment out of date. Remove http://gerrit.cloudera.org:8080/#/c/8894/6/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/8894/6/testdata/bin/generate-schema-statements.py@627 PS6, Line 627: invalidate_table_stmt = "INVALIDATE METADATA {0}.{1};\n".format(db, table_name) : impala_invalidate.create.append(invalidate_table_stmt) Some tables should be able to use refresh instead. I think only Hive-created tables need invalidate. http://gerrit.cloudera.org:8080/#/c/8894/6/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/8894/6/testdata/datasets/functional/functional_schema_template.sql@330 PS6, Line 330: USE {db_name}{db_suffix}; Not needed, remove. -- To view, visit http://gerrit.cloudera.org:8080/8894 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd Gerrit-Change-Number: 8894 Gerrit-PatchSet: 6 Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Comment-Date: Sat, 06 Jan 2018 02:15:57 +0000 Gerrit-HasComments: Yes