[
https://issues.apache.org/jira/browse/OOZIE-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426962#comment-17426962
]
Andras Salamon commented on OOZIE-3640:
---------------------------------------
[~puru] Solr was using Option 3 style directory size calculation and it was
very slow on HDFS. Please check the performance as suggested by [~dionusos]. I
measured the performance in Solr using 100000-1000000 files and the difference
was very big.
> Add support for recursive directories fs:dirSize
> ------------------------------------------------
>
> Key: OOZIE-3640
> URL: https://issues.apache.org/jira/browse/OOZIE-3640
> Project: Oozie
> Issue Type: Improvement
> Reporter: Purshotam Shah
> Priority: Major
>
> There are three ways to do that.
> # Use getContentSummary: This can be dangerous for name nodes.
> getContentSummary on a large HDFS dirs can take minutes. During that time, an
> oozie thread will be blocked, waiting for the RPC response. A 5 min workflow
> doing a content summary on a directory that takes > 5 min may lead to oozie
> thread exhaustion. or if oozie times out and retries, the NN has no support
> for aborting a call being processed, so now there will be multiple concurrent
> content summaries which may also exhaust the NN's handler threads.
> # Use getQuotaUsage: If quote is not enabled, it will fall back on
> getContentSummary. So this is as bad as getContentSummary.
> # Use recursive listing to compute size. Enforce system-level dir size(or
> file count) and recursive level if it reached the max-level or max-size throw
> exception.
> Considering all three options.
> Option 3 is the best option. A system admin can configure max-level based on
> system load and user use-cases.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)