[ 
https://issues.apache.org/jira/browse/OOZIE-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426962#comment-17426962
 ] 

Andras Salamon commented on OOZIE-3640:
---------------------------------------

[~puru] Solr was using Option 3 style directory size calculation and it was 
very slow on HDFS. Please check the performance as suggested by [~dionusos]. I 
measured the performance in Solr using 100000-1000000 files and the difference 
was very big.

> Add support for recursive directories fs:dirSize
> ------------------------------------------------
>
>                 Key: OOZIE-3640
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3640
>             Project: Oozie
>          Issue Type: Improvement
>            Reporter: Purshotam Shah
>            Priority: Major
>
> There are three ways to do that.
>  # Use getContentSummary: This can be dangerous for name nodes. 
> getContentSummary on a large HDFS dirs can take minutes. During that time, an 
> oozie thread will be blocked, waiting for the RPC response. A 5 min workflow 
> doing a content summary on a directory that takes > 5 min may lead to oozie 
> thread exhaustion. or if oozie times out and retries, the NN has no support 
> for aborting a call being processed, so now there will be multiple concurrent 
> content summaries which may also exhaust the NN's handler threads.
>  # Use getQuotaUsage: If quote is not enabled, it will fall back on 
> getContentSummary. So this is as bad as getContentSummary.
>  # Use recursive listing to compute size. Enforce system-level dir size(or 
> file count) and recursive level if it reached the max-level or max-size throw 
> exception.
> Considering all three options.
>  Option 3 is the best option. A system admin can configure max-level based on 
> system load and user use-cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to