Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8235 )
Change subject: IMPALA-5429: Multi threaded block metadata loading ...................................................................... IMPALA-5429: Multi threaded block metadata loading Implements multi threaded block metadata loading on the Catalog server where we fetch block metadata for multiple partitions of a single table in parallel. Number of threads to load the metadata is controlled by the following two parameters (set on the Catalog server startup and applies for each table load) -max_hdfs_partitions_parallel_load(default=5) -max_nonhdfs_partitions_parallel_load(default=20) We use different thread pool sizes for HDFS and non-HDFS tables since non-HDFS supports much higher throughput of RPC calls for listStatus /listFiles. Based on our experiments, S3 showed a linear speed up (up to ~113x) with increasing number of loading threads where as the HDFS throughput was limited to ~5x in un-secure clusters and up to ~3.7x in secure clusters. We narrowed it down to scalability bottlenecks in HDFS RPC implementation (HADOOP-14558) on both the server and the client side. One thing to note here is that the thread pool based metadata fetching is implemented only for loading HDFS block metadata and not for loading HMS partition information. Our experiments showed that while loading large partitioned tables, ~90% of the time is spent in connecting to NN and loading the HDFS block information and optimizing the rest ~10% makes the code unnecessarily complex without much gain. Additional notes: - The multithreading approach is implemented for * INVALIDATE (loading from scratch), * REFRESH (reusing existing md) code paths, * ALTER TABLE ADD/RECOVER PARTITIONS. - This patch makes the implementation of ListMap thread-safe since we use that data structure as a shared state between multiple partition metadata loding threads. Testing and Results: - This patch doesn't add any new tests since there is enough test coverage already. Passed core/exhaustive runs with HDFS/S3. - We noticed up to ~113x speedup on S3 tables(thread_pool_size=160) and up to ~5x speed up in un-secure HDFS clusters and ~3.7x in secure HDFS clusters. - Synthesized the following two large tables on HDFS and S3 and noticed significant reduction in my test DDL queries. (1) 100K partitions + 1 million files (2) 80 partitions + 250K files 100K-PARTITIONS-1M-FILES-CUSTOM-11-REFRESH-PARTITION I -16.4% 100K-PARTITIONS-1M-FILES-CUSTOM-08-ADD-PARTITION I -17.25% 80-PARTITIONS-250K-FILES-11-REFRESH-PARTITION I -23.57% 80-PARTITIONS-250K-FILES-S3-08-ADD-PARTITION I -23.87% 80-PARTITIONS-250K-FILES-09-INVALIDATE I -24.88% 80-PARTITIONS-250K-FILES-03-RECOVER I -35.90% 80-PARTITIONS-250K-FILES-07-REFRESH I -43.03% 100K-PARTITIONS-1M-FILES-CUSTOM-12-QUERY-PARTITIONS I -43.93% 100K-PARTITIONS-1M-FILES-CUSTOM-05-QUERY-AFTER-INV I -46.59% 80-PARTITIONS-250K-FILES-10-REFRESH-AFTER-ADD-PARTITION I -48.71% 100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH I -49.02% 80-PARTITIONS-250K-FILES-05-QUERY-AFTER-INV I -49.05% 100K-PARTITIONS-1M-FILES-CUSTOM-10-REFRESH-AFTER-ADD-PARTI -51.87% 80-PARTITIONS-250K-FILES-S3-03-RECOVER I -67.17% 80-PARTITIONS-250K-FILES-S3-05-QUERY-AFTER-INV I -76.45% 80-PARTITIONS-250K-FILES-S3-07-REFRESH I -87.04% 80-PARTITIONS-250K-FILES-S3-10-REFRESH-AFTER-ADD-PART I -88.57% Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481 Reviewed-on: http://gerrit.cloudera.org:8080/8235 Reviewed-by: Bharath Vissapragada <bhara...@cloudera.com> Tested-by: Impala Public Jenkins --- M be/src/catalog/catalog.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartitionLocationCompressor.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/JniCatalog.java M fe/src/main/java/org/apache/impala/util/ListMap.java 9 files changed, 462 insertions(+), 242 deletions(-) Approvals: Bharath Vissapragada: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8235 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481 Gerrit-Change-Number: 8235 Gerrit-PatchSet: 12 Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Jim Apple <jbapple-imp...@apache.org> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>