[
https://issues.apache.org/jira/browse/IMPALA-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer updated IMPALA-6112:
------------------------------------
Labels: metadata performance (was: metadata perfomance)
> Improve the thread pool size detection logic while loading partitioned table
> block metadata
> -------------------------------------------------------------------------------------------
>
> Key: IMPALA-6112
> URL: https://issues.apache.org/jira/browse/IMPALA-6112
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Affects Versions: Impala 2.11.0
> Reporter: Bharath Vissapragada
> Priority: Major
> Labels: metadata, performance
>
> IMPALA-5429 added a thread pool based block metadata loading logic to
> increase the block metadata loading throughput.
> {noformat}
> /**
> * Returns the thread pool size to load the file metadata of this table.
> * 'numPaths' is the number of paths for which the file metadata should be
> loaded.
> *
> * We use different thread pool sizes for HDFS and non-HDFS tables since
> the latter
> * supports much higher throughput of RPC calls for listStatus/listFiles.
> For
> * simplicity, the filesystem type is determined based on the table's root
> path and
> * not for each partition individually. Based on our experiments, S3 showed
> a linear
> * speed up (up to ~100x) with increasing number of loading threads where
> as the HDFS
> * throughput was limited to ~5x in un-secure clusters and up to ~3.7x in
> secure
> * clusters. We narrowed it down to scalability bottlenecks in HDFS RPC
> implementation
> * (HADOOP-14558) on both the server and the client side.
> */
> private int getLoadingThreadPoolSize(int numPaths) throws CatalogException {
> Preconditions.checkState(numPaths > 0);
> FileSystem tableFs;
> try {
> tableFs = (new Path(getLocation())).getFileSystem(CONF);
> } catch (IOException e) {
> throw new CatalogException("Invalid table path for table: " +
> getFullName(), e);
> }
> int threadPoolSize = FileSystemUtil.supportsStorageIds(tableFs) ?
> MAX_HDFS_PARTITIONS_PARALLEL_LOAD :
> MAX_NON_HDFS_PARTITIONS_PARALLEL_LOAD;
> // Thread pool size need not exceed the number of paths to be loaded.
> return Math.min(numPaths, threadPoolSize);
> }
> {noformat}
> As the method comment says, we choose the thread pool size based on table
> base directory FS type. Given we support multiple filesystem types in the
> same partitioned table, this may not always be optimal. For example if the
> base table file system is on HDFS, but 90% of the partitions are on S3, the
> above method returns a smaller thread pool size which is sub-optimal. This
> jira is to fix that behavior.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]