[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900622#action_12900622
 ] 

Hudson commented on HDFS-202:
-

Integrated in Hadoop-Hdfs-trunk-Commit #370 (See 
[https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/370/])


 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client, name-node
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
 hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, 
 hdfsListFiles5.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-20 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900645#action_12900645
 ] 

Amareshwari Sriramadasu commented on HDFS-202:
--

Shouldn't we mark feature as Incompatible change? It changed the signature of 
getListing() and broke MaReduce build, MAPREDUCE-2022

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client, name-node
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
 hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, 
 hdfsListFiles5.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-11 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897379#action_12897379
 ] 

Suresh Srinivas commented on HDFS-202:
--

Comments:
# ListPathAspects.aj - callGetListing() method has description which says rename
# HDFSFileLocatedStatus.java - missing banner.


 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
 hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-11 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897381#action_12897381
 ] 

Suresh Srinivas commented on HDFS-202:
--

+1 for the patch if the above comments are taken care of.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
 hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-11 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897409#action_12897409
 ] 

Hairong Kuang commented on HDFS-202:


Not able to run ant test-patch because the trunk does not compile. But I 
checked that this patch does not introduce new Javadoc warnings and adds new 
tests. There were quite a few unit tests failing. But seems not related to this 
patch.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client, name-node
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
 hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, 
 hdfsListFiles5.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-11 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897432#action_12897432
 ] 

Konstantin Shvachko commented on HDFS-202:
--

 the trunk does not compile
See 
[here|https://issues.apache.org/jira/browse/HADOOP-6900?focusedCommentId=12897389page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12897389]

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client, name-node
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
 hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, 
 hdfsListFiles5.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-11 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897444#action_12897444
 ] 

Hairong Kuang commented on HDFS-202:


Konstantin, the hdfs trunk should be able to compile because I've committed 
this patch. HDFS-202 is the HDFS side of HADOOP-6900!

Thanks Suresh for reviewing this patch at full speed! :-)

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client, name-node
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
 hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, 
 hdfsListFiles5.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-11 Thread Konstantin Shvachko

Yes I see it compiles now.
Thanks,
--konst

On 8/11/2010 1:47 PM, Hairong Kuang (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897444#action_12897444
 ]

Hairong Kuang commented on HDFS-202:


Konstantin, the hdfs trunk should be able to compile because I've committed 
this patch. HDFS-202 is the HDFS side of HADOOP-6900!

Thanks Suresh for reviewing this patch at full speed! :-)


Add a bulk FIleSystem.getFileBlockLocations
---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client, name-node
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, 
hdfsListFiles5.patch


Currently map-reduce applications (specifically file-based input-formats) use 
FileSystem.getFileBlockLocations to compute splits. However they are forced to 
call it once per file.
The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
each call results in 'search' in the namesystem. Assuming a few thousand input 
files it results in that many RPCs and 'searches'.
It would be nice to have a FileSystem.getFileBlockLocations which can take in a 
directory, and return the block-locations for all files in that directory. We 
could eliminate both the per-file RPC and also the 'search' by a 'scan'.
When I tested this for terasort, a moderate job with 8000 input files the 
runtime halved from the current 8s to 4s. Clearly this is much more important 
for latency-sensitive applications...






[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895142#action_12895142
 ] 

Hadoop QA commented on HDFS-202:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch
  against trunk revision 982091.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 15 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/228/console

This message is automatically generated.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
 hdfsListFiles2.patch, hdfsListFiles3.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-08-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894735#action_12894735
 ] 

Hadoop QA commented on HDFS-202:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch
  against trunk revision 981289.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 15 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The patch appears to cause tar ant target to fail.

-1 findbugs.  The patch appears to cause Findbugs to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/testReport/
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/console

This message is automatically generated.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, 
 hdfsListFiles2.patch, hdfsListFiles3.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-30 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894070#action_12894070
 ] 

Hairong Kuang commented on HDFS-202:


As I commented in HADOOP-6890, I would prefer throwing exceptions when a 
file/directory is deleted during listing. This is because getFiles is used by 
MapReduce job client to calculate splits. So the expectation is that the input 
directories remain no change during job execution. It is good to fail the job 
earlier than later.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-29 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893883#action_12893883
 ] 

Suresh Srinivas commented on HDFS-202:
--

Unix 'ls' returns all the results in one shot. However, when getting response 
iteratively the behavior is different:
# When listing a single directory, if some ls results has been returned and the 
directory is deleted, we should throw FileNotFoundException, to indicate the 
directory is no longer available.
# When recursively listing under a directory, if a subdirectory is deleted, the 
more appropriate response is to ignore FileNotFound for that directory and 
return the remaining results. This would be consistent with what the result 
would be, if the command is repeated. Further, if an application is listing 
recursively a large directory, the state of the directory keeps changing, an 
application may have to try many times to list it.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-23 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891743#action_12891743
 ] 

Hairong Kuang commented on HDFS-202:


I am not sure what should we do if a child of the input directory is a symbolic 
link. Whether the symbolic link should be resolved or not better to be decided 
by applications.

It seems cleaner if the new API changes to be listLocatedFileStatus(Path path) 
so it does not traverse the subtree recursively and it returns all the content 
of the directory. BlockLocations are piggybacked if a child is a file. This 
design decision leaves the questions like how to deal with when a child is a 
symbolic link or a directory to be answered by applications. 

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-23 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891763#action_12891763
 ] 

Doug Cutting commented on HDFS-202:
---

 I am not sure what should we do if a child of the input directory is a 
 symbolic link.

Handling of symlinks should be addressed in HADOOP-6870, no?

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-23 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891809#action_12891809
 ] 

Hairong Kuang commented on HDFS-202:


Hi Doug, thanks for your review comments.

Yes, Handling of symlinks should be addressed in FileContext in HADOOP-6870. 
HDFS-202 severs as the discussion board for this issue. So I posted the 
question here.

My question is whether this new API should handle recursive traversal and 
symbolic resolution. Is it cleaner if it does not do any of these and leave 
decisions to applications?

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-23 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891828#action_12891828
 ] 

Doug Cutting commented on HDFS-202:
---

 My question is whether this new API should handle recursive traversal and 
 symbolic resolution.

My intuition is that recursive file listings for open should follow symbolic 
links, since open follows symbolic links.  Recursive traversal for remove 
should not follow symbolic links, but should just remove the symbolic link, 
like remove does on a symbolic link.


 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0

 Attachments: hdfsListFiles.patch


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-21 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890914#action_12890914
 ] 

Hairong Kuang commented on HDFS-202:


Taking multiple paths as an input to a FileContext API and HDFS clinet-NN rpc 
seems to be a bad idea. It adds quite a lot of complexity for grouping paths by 
file systems and for resolving symbolic links. Does not sound clean and I'd 
like to avoid it. So here is the revised proposal:

{code}
class LocatedFileStatus extends FileStatus {
  BlockLocation [] blocks;
}
{code}
FileSystem and FileContext will have a new API
{code}
public IteratorFileStatusAndBlockLocations listLocatedFileStatus(Path path, 
boolean isRecursive);
{code}
This new API is similar to FileContext#listStatus in many ways except that the 
returned LocatedFileStatus contains its block locations and if isRecursive is 
true, all the files in the subtree rooted at the input path will be returned.

Similarly in HDFS, we will have
{code}
class HdfsLocatedFileStatus extends HdfsFileStaus {
  BlockLocations[] blocks;
}
{code}
ClientProtocol will add one more parameter boolean withLocation to the 
existing getListing RPC.
{code}
public DirectoryListing getListing(String src,
 byte[] startAfter,
 boolean withLocation)
  throws AccessControlException, FileNotFoundException,
  UnresolvedLinkException, IOException;
{code}
If withLocation is false, the semantics is the same as before. When 
withLocations is true, DirectoryListing will contains LocatedFileStatus.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-16 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889249#action_12889249
 ] 

Hairong Kuang commented on HDFS-202:


I want to explain the difference between my proposal and the previous proposal.
1. For the FileSystem API, the user can specify whether the input paths need to 
be recursively traversed or not. The return result is an iterator, which allows 
the input files to be fetched from server one batch at a time so to avoid OOM 
exception when input paths are huge.
2. The design of new RPCs allows us to return HdfsFileStatus (local file name) 
instead of FileStatus (full path name), saving CPU processing time. It also 
allows us to easily limit the response size.

If nobody is against it, I will go ahead with the implementation.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-16 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889406#action_12889406
 ] 

dhruba borthakur commented on HDFS-202:
---

+1 to this proposal.

 The return result is an iterator, which allows the input files to be fetched 
 from 

However, if the number of files in a diectory are few (say 500), then we can 
still fetch everything in on RPC, isn't it?


 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-16 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889432#action_12889432
 ] 

Hairong Kuang commented on HDFS-202:


 if the number of files in a diectory are few (say 500), then we can still 
 fetch everything in on RPC, isn't it?
I will reuse DFS_LIST_LIMIT introduced in HDFS-985. Its default value is 1000. 
So by default, 500 will be fetched in one RPC.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-14 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888626#action_12888626
 ] 

Hairong Kuang commented on HDFS-202:


The above proposed method is an API in FileSystem.

Internally in HDFS, I plan to add two new client-to-namenode RPCs:

class HdfsFileStatusAndBlockLocations {
  HdfsFileStatus fileStatus;
  BlockLocation [] blocks;
}

/**
  * Given an array of input paths, return an array of file status and block 
locations.
  * The input array and output array have the same size.
  * The ith item in the output array is the file status and block locations of 
the ith path in input array. 
  * if an input path is a directory, its block locations is empty.
  */
HdfsFileStatusAndBlockLocations[] getFileStatusAndBlockLocations( Path[] paths);

/**
 * Given an input directory, return the file status and block locations of its 
children.
 */
HdfsFileStatusAndBlockLocations[] listFileStatusAndBlockLocations(Path path);

Suppose the subtrees that represent a job's input paths contain N directories, 
the two APIs allow a dfs client to issue N+1 RPCs to NameNode to implement the 
above proposed file system API.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-14 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888638#action_12888638
 ] 

Hairong Kuang commented on HDFS-202:


I also plan to use the same idea of iterative listing (HDFS-985) to limit the 
size of the response when listingFileStatusAndBlockLocations of a directory.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-12 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887592#action_12887592
 ] 

Hairong Kuang commented on HDFS-202:


I am quite bothered that the proposed API returns a map. Is the reason for 
returning a map because the API does one-level listPath? Is there a use case 
that needs only one-level expansion?

If we eventually need to get the block locations of all files recursively under 
the input paths, is the following API a better choice?
{code}
/**
  * @return the block locations of all files recursively under the input paths
  */
IteratorBlockLocation getBlockLocations(Path[] paths)
{code}
When implementing this in HDFS, we might need to issue multiple RPCs and be 
very careful to limit the size of each RPC request and response.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2010-07-12 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887600#action_12887600
 ] 

Hairong Kuang commented on HDFS-202:


I read FileInputFormat and understand the usecase much better. So the client 
needs to know FileStatus  for filtering and there is a configuration parameter 
to specify whether the input paths need to be traversed recursively. In this 
case, how about the following revised API?
{code}
class FileStatusAndBlockLocations {
  FileStatus fileStatus;
  BlockLocation [] blocks;
}

IteratorFileStatusAndBlockLocations getBlockLocations(Path[] paths, boolean 
isRecursive);
{code}

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Hairong Kuang
 Fix For: 0.22.0


 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2009-09-15 Thread Sanjay Radia (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755567#action_12755567
 ] 

Sanjay Radia commented on HDFS-202:
---

 Maybe we should punt that until someone develops an append-savvy distcp?
+1

Why is DetailedFileStatus[] better than MapFileStatus,BlockLocation[]? The 
latter seems more transparent.
I was holding out on a file system interface return a map. But that is old 
school.
Fine I am convinced.

I suspect you also want the rpc signature to return a map (that makes me more 
nervous because most rpcs do not support that - but ours does I guess.).


-

Wrt to the new FileContext api, my  proposal is that its provides a single  
getBlockLocation method:

MapFileStatus,BlockLocation[] getBlockLocations(Path[] path)

and abandon the BlockLocation[] getBlockLocations(path, start, end).


(of course FileSystem will continue to support the old getBlockLocations.)



 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Jakob Homan

 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2009-09-15 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755621#action_12755621
 ] 

dhruba borthakur commented on HDFS-202:
---

+1

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Jakob Homan

 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2009-09-09 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753215#action_12753215
 ] 

Doug Cutting commented on HDFS-202:
---

 Is the optimization for sending only partial block reports really necessary?

It may help the append-savvy distcp use case, but is not in the mapred job 
submission use case.  Even in the append-savvy distcp use case, it's not clear 
that it's required.  Maybe we should punt that until someone develops an 
append-savvy distcp?

 Why not create a class called DetailedFileStatus which contains both the file 
 status and block locations:

Why is DetailedFileStatus[] better than MapFileStatus,BlockLocation[]?  The 
latter seems more transparent.

 DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping 
 between the two arrays as Doug suggested.

That was intended for the append-savvy distcp use case.  The original use case 
was for mapred job submission, where we typically have a list of directories.  
With directories there is not a 1:1 mapping.



 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Jakob Homan

 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations

2009-09-08 Thread Sanjay Radia (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12752748#action_12752748
 ] 

Sanjay Radia commented on HDFS-202:
---

Is the optimization for sending only partial block reports really necessary? 
Most files have very few blocks ...
Also arun's point of doing an extra call for doing the getFileStatus() is valid.

Why not create a class called DetailedFileStatus which contains both the file 
status and block locations:


DetailedFileStatus[] = getBlockLocations(Path[] paths);  // 1:1 mapping between 
the two arrays as Doug suggested.

We can add the range one later if we really need that optimization.

 Add a bulk FIleSystem.getFileBlockLocations
 ---

 Key: HDFS-202
 URL: https://issues.apache.org/jira/browse/HDFS-202
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Jakob Homan

 Currently map-reduce applications (specifically file-based input-formats) use 
 FileSystem.getFileBlockLocations to compute splits. However they are forced 
 to call it once per file.
 The downsides are multiple:
# Even with a few thousand files to process the number of RPCs quickly 
 starts getting noticeable
# The current implementation of getFileBlockLocations is too slow since 
 each call results in 'search' in the namesystem. Assuming a few thousand 
 input files it results in that many RPCs and 'searches'.
 It would be nice to have a FileSystem.getFileBlockLocations which can take in 
 a directory, and return the block-locations for all files in that directory. 
 We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
 When I tested this for terasort, a moderate job with 8000 input files the 
 runtime halved from the current 8s to 4s. Clearly this is much more important 
 for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.