[jira] [Updated] (IMPALA-6112) Improve the thread pool size detection logic while loading partitioned table block metadata

Csaba Ringhofer (Jira) Tue, 07 Feb 2023 03:10:04 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Csaba Ringhofer updated IMPALA-6112:
------------------------------------
    Labels: metadata performance  (was: metadata perfomance)

> Improve the thread pool size detection logic while loading partitioned table 
> block metadata
> -------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-6112
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6112
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 2.11.0
>            Reporter: Bharath Vissapragada
>            Priority: Major
>              Labels: metadata, performance
>
> IMPALA-5429 added a thread pool based block metadata loading logic to 
> increase the block metadata loading throughput. 
> {noformat}
>   /**
>    * Returns the thread pool size to load the file metadata of this table.
>    * 'numPaths' is the number of paths for which the file metadata should be 
> loaded.
>    *
>    * We use different thread pool sizes for HDFS and non-HDFS tables since 
> the latter
>    * supports much higher throughput of RPC calls for listStatus/listFiles. 
> For
>    * simplicity, the filesystem type is determined based on the table's root 
> path and
>    * not for each partition individually. Based on our experiments, S3 showed 
> a linear
>    * speed up (up to ~100x) with increasing number of loading threads where 
> as the HDFS
>    * throughput was limited to ~5x in un-secure clusters and up to ~3.7x in 
> secure
>    * clusters. We narrowed it down to scalability bottlenecks in HDFS RPC 
> implementation
>    * (HADOOP-14558) on both the server and the client side.
>    */
>   private int getLoadingThreadPoolSize(int numPaths) throws CatalogException {
>     Preconditions.checkState(numPaths > 0);
>     FileSystem tableFs;
>     try {
>       tableFs  = (new Path(getLocation())).getFileSystem(CONF);
>     } catch (IOException e) {
>       throw new CatalogException("Invalid table path for table: " + 
> getFullName(), e);
>     }
>     int threadPoolSize = FileSystemUtil.supportsStorageIds(tableFs) ?
>         MAX_HDFS_PARTITIONS_PARALLEL_LOAD : 
> MAX_NON_HDFS_PARTITIONS_PARALLEL_LOAD;
>     // Thread pool size need not exceed the number of paths to be loaded.
>     return Math.min(numPaths, threadPoolSize);
>   }
> {noformat}
> As the method comment says, we choose the thread pool size based on table 
> base directory FS type. Given we support multiple filesystem types in the 
> same partitioned table, this may not always be optimal. For example if the 
> base table file system is on HDFS, but 90% of the partitions are on S3, the 
> above method returns a smaller thread pool size which is sub-optimal. This 
> jira is to fix that behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-6112) Improve the thread pool size detection logic while loading partitioned table block metadata

Reply via email to