[ 
https://issues.apache.org/jira/browse/HIVE-14423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408281#comment-15408281
 ] 

Chris Nauroth commented on HIVE-14423:
--------------------------------------

Hello [~rajesh.balamohan].  Thank you for the patch.  This looks really 
valuable for S3A, WASB and other file systems backed by blob stores, but I have 
a question about whether it will change load patterns and performance 
characteristics when running on HDFS.

For HDFS, {{getContentSummary}} is a single RPC to the NameNode.  It's possibly 
the most expensive NameNode RPC, at least among the read APIs, because the 
NameNode needs to hold a lock while traversing the entire inode sub-tree.  
However, it does have the benefit of getting all of the calculation done for a 
single path/partition in a single network call, so overall, this Hive algorithm 
is O(N) where N = # partitions.

With this patch, it starts using {{FileSystem#listFiles}} with the recursive 
option, which turns into multiple {{getListing}} NameNode RPCs, one for each 
sub-directory.  The {{getListing}} RPC is less expensive for the NameNode to 
execute compared to {{getContentSummary}}, but overall this algorithm requires 
many more network round-trips: O(N * M) where N = # partitions and M = average 
# directories per partition.

At this point in the Hive code, is it possible that the partitions refer to 
directories in the file system that are multiple levels deep with nested 
sub-directories?  I suspect the answer is yes, because the existing code used 
{{getContentSummary}}, and your patch used the recursive option for 
{{listFiles}}.

Do you think an alternative approach would be to override {{getContentSummary}} 
in {{S3AFileSystem}} and optimize it?  That might look similar to other 
optimizations that are making use of S3 bulk listings, such as HADOOP-13208 and 
HADOOP-13371.

Parallelizing the calls for all partitions looks valuable regardless of which 
approach we take.

Cc [~ste...@apache.org] FYI for when he returns.

> S3: Fetching partition sizes from FS can be expensive when stats are not 
> available in metastore 
> ------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-14423
>                 URL: https://issues.apache.org/jira/browse/HIVE-14423
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HIVE-14423.1.patch
>
>
> When partition stats are not available in metastore, it tries to get the file 
> sizes from FS.
> e.g
> {noformat}
>         at 
> org.apache.hadoop.fs.FileSystem.getContentSummary(FileSystem.java:1487)
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForPartitions(StatsUtils.java:598)
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:235)
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:144)
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:132)
>         at 
> org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$TableScanStatsRule.process(StatsRulesProcFactory.java:126)
>         at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
>         at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
> {noformat}
> This can be quite expensive in some FS like S3. Especially when table is 
> partitioned (e.g TPC-DS store_sales which has 1000s of partitions), query can 
> spend 1000s of seconds just waiting for these information to be pulled in.
> Also, it would be good to remove FS.getContentSummary usage to find out file 
> sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to