[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

Sumit Kumar (JIRA) Mon, 02 Jun 2014 09:29:23 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015515#comment-14015515
 ]


Sumit Kumar commented on MAPREDUCE-5907:
----------------------------------------

Added new changes to:
1. use iterator based listLocatedStatus apis instead of listStatus apis. 
Removed "not-required" recursive flavors of listStatus apis that could cause 
memory concerns raised in HADOOP-10634
2. change s3n implementation to use the listLocatedStatus api abstraction, this 
validated the implementation of the new recursive apis as well.
3. added a test case that demonstrates how such a recursive listing benefits 
s3N. The test case simulates an hourly rotated log aggregation and processing a 
year long data. Total number of calls reduces to just 10 instead of 360 calls 
(12 months * 30 days).
4. Fixed few bugs in InMemoryNativeFileSystemStore while validating the test 
case.

I did spend sometime on Swift object store implementation but it doesn't have 
that iteration based abstraction (neither at store level, nor at the filesystem 
level). Looking at the recursive implementation, for swift fs, it seems that it 
would try to get all the files/directories from the backend in just one 
webservice call. I suspect it would suffer from memory issues when such 
recursive calls are made. I may be wrong though so please correct me if i'm 
wrong.

[~ste...@apache.org] How should we deal with this? Are you aware of an 
iterative webservice api where we could list a swift fs directory recursively 
but in batches of say 1000 or 10000 entries (as may seem appropriate).

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5907
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 2.4.0
>            Reporter: Sumit Kumar
>            Assignee: Sumit Kumar
>         Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

Reply via email to