[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900622#action_12900622 ] Hudson commented on HDFS-202: - Integrated in Hadoop-Hdfs-trunk-Commit #370 (See [https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/370/]) Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client, name-node Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, hdfsListFiles5.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900645#action_12900645 ] Amareshwari Sriramadasu commented on HDFS-202: -- Shouldn't we mark feature as Incompatible change? It changed the signature of getListing() and broke MaReduce build, MAPREDUCE-2022 Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client, name-node Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, hdfsListFiles5.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897379#action_12897379 ] Suresh Srinivas commented on HDFS-202: -- Comments: # ListPathAspects.aj - callGetListing() method has description which says rename # HDFSFileLocatedStatus.java - missing banner. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897381#action_12897381 ] Suresh Srinivas commented on HDFS-202: -- +1 for the patch if the above comments are taken care of. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897409#action_12897409 ] Hairong Kuang commented on HDFS-202: Not able to run ant test-patch because the trunk does not compile. But I checked that this patch does not introduce new Javadoc warnings and adds new tests. There were quite a few unit tests failing. But seems not related to this patch. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client, name-node Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, hdfsListFiles5.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897432#action_12897432 ] Konstantin Shvachko commented on HDFS-202: -- the trunk does not compile See [here|https://issues.apache.org/jira/browse/HADOOP-6900?focusedCommentId=12897389page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12897389] Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client, name-node Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, hdfsListFiles5.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897444#action_12897444 ] Hairong Kuang commented on HDFS-202: Konstantin, the hdfs trunk should be able to compile because I've committed this patch. HDFS-202 is the HDFS side of HADOOP-6900! Thanks Suresh for reviewing this patch at full speed! :-) Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client, name-node Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, hdfsListFiles5.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
Yes I see it compiles now. Thanks, --konst On 8/11/2010 1:47 PM, Hairong Kuang (JIRA) wrote: [ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897444#action_12897444 ] Hairong Kuang commented on HDFS-202: Konstantin, the hdfs trunk should be able to compile because I've committed this patch. HDFS-202 is the HDFS side of HADOOP-6900! Thanks Suresh for reviewing this patch at full speed! :-) Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client, name-node Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch, hdfsListFiles4.patch, hdfsListFiles5.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895142#action_12895142 ] Hadoop QA commented on HDFS-202: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch against trunk revision 982091. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/228/console This message is automatically generated. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894735#action_12894735 ] Hadoop QA commented on HDFS-202: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch against trunk revision 981289. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/console This message is automatically generated. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch, hdfsListFiles3.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894070#action_12894070 ] Hairong Kuang commented on HDFS-202: As I commented in HADOOP-6890, I would prefer throwing exceptions when a file/directory is deleted during listing. This is because getFiles is used by MapReduce job client to calculate splits. So the expectation is that the input directories remain no change during job execution. It is good to fail the job earlier than later. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893883#action_12893883 ] Suresh Srinivas commented on HDFS-202: -- Unix 'ls' returns all the results in one shot. However, when getting response iteratively the behavior is different: # When listing a single directory, if some ls results has been returned and the directory is deleted, we should throw FileNotFoundException, to indicate the directory is no longer available. # When recursively listing under a directory, if a subdirectory is deleted, the more appropriate response is to ignore FileNotFound for that directory and return the remaining results. This would be consistent with what the result would be, if the command is repeated. Further, if an application is listing recursively a large directory, the state of the directory keeps changing, an application may have to try many times to list it. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch, hdfsListFiles1.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891743#action_12891743 ] Hairong Kuang commented on HDFS-202: I am not sure what should we do if a child of the input directory is a symbolic link. Whether the symbolic link should be resolved or not better to be decided by applications. It seems cleaner if the new API changes to be listLocatedFileStatus(Path path) so it does not traverse the subtree recursively and it returns all the content of the directory. BlockLocations are piggybacked if a child is a file. This design decision leaves the questions like how to deal with when a child is a symbolic link or a directory to be answered by applications. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891763#action_12891763 ] Doug Cutting commented on HDFS-202: --- I am not sure what should we do if a child of the input directory is a symbolic link. Handling of symlinks should be addressed in HADOOP-6870, no? Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891809#action_12891809 ] Hairong Kuang commented on HDFS-202: Hi Doug, thanks for your review comments. Yes, Handling of symlinks should be addressed in FileContext in HADOOP-6870. HDFS-202 severs as the discussion board for this issue. So I posted the question here. My question is whether this new API should handle recursive traversal and symbolic resolution. Is it cleaner if it does not do any of these and leave decisions to applications? Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891828#action_12891828 ] Doug Cutting commented on HDFS-202: --- My question is whether this new API should handle recursive traversal and symbolic resolution. My intuition is that recursive file listings for open should follow symbolic links, since open follows symbolic links. Recursive traversal for remove should not follow symbolic links, but should just remove the symbolic link, like remove does on a symbolic link. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: hdfsListFiles.patch Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890914#action_12890914 ] Hairong Kuang commented on HDFS-202: Taking multiple paths as an input to a FileContext API and HDFS clinet-NN rpc seems to be a bad idea. It adds quite a lot of complexity for grouping paths by file systems and for resolving symbolic links. Does not sound clean and I'd like to avoid it. So here is the revised proposal: {code} class LocatedFileStatus extends FileStatus { BlockLocation [] blocks; } {code} FileSystem and FileContext will have a new API {code} public IteratorFileStatusAndBlockLocations listLocatedFileStatus(Path path, boolean isRecursive); {code} This new API is similar to FileContext#listStatus in many ways except that the returned LocatedFileStatus contains its block locations and if isRecursive is true, all the files in the subtree rooted at the input path will be returned. Similarly in HDFS, we will have {code} class HdfsLocatedFileStatus extends HdfsFileStaus { BlockLocations[] blocks; } {code} ClientProtocol will add one more parameter boolean withLocation to the existing getListing RPC. {code} public DirectoryListing getListing(String src, byte[] startAfter, boolean withLocation) throws AccessControlException, FileNotFoundException, UnresolvedLinkException, IOException; {code} If withLocation is false, the semantics is the same as before. When withLocations is true, DirectoryListing will contains LocatedFileStatus. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889249#action_12889249 ] Hairong Kuang commented on HDFS-202: I want to explain the difference between my proposal and the previous proposal. 1. For the FileSystem API, the user can specify whether the input paths need to be recursively traversed or not. The return result is an iterator, which allows the input files to be fetched from server one batch at a time so to avoid OOM exception when input paths are huge. 2. The design of new RPCs allows us to return HdfsFileStatus (local file name) instead of FileStatus (full path name), saving CPU processing time. It also allows us to easily limit the response size. If nobody is against it, I will go ahead with the implementation. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889406#action_12889406 ] dhruba borthakur commented on HDFS-202: --- +1 to this proposal. The return result is an iterator, which allows the input files to be fetched from However, if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it? Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889432#action_12889432 ] Hairong Kuang commented on HDFS-202: if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it? I will reuse DFS_LIST_LIMIT introduced in HDFS-985. Its default value is 1000. So by default, 500 will be fetched in one RPC. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888626#action_12888626 ] Hairong Kuang commented on HDFS-202: The above proposed method is an API in FileSystem. Internally in HDFS, I plan to add two new client-to-namenode RPCs: class HdfsFileStatusAndBlockLocations { HdfsFileStatus fileStatus; BlockLocation [] blocks; } /** * Given an array of input paths, return an array of file status and block locations. * The input array and output array have the same size. * The ith item in the output array is the file status and block locations of the ith path in input array. * if an input path is a directory, its block locations is empty. */ HdfsFileStatusAndBlockLocations[] getFileStatusAndBlockLocations( Path[] paths); /** * Given an input directory, return the file status and block locations of its children. */ HdfsFileStatusAndBlockLocations[] listFileStatusAndBlockLocations(Path path); Suppose the subtrees that represent a job's input paths contain N directories, the two APIs allow a dfs client to issue N+1 RPCs to NameNode to implement the above proposed file system API. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888638#action_12888638 ] Hairong Kuang commented on HDFS-202: I also plan to use the same idea of iterative listing (HDFS-985) to limit the size of the response when listingFileStatusAndBlockLocations of a directory. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887592#action_12887592 ] Hairong Kuang commented on HDFS-202: I am quite bothered that the proposed API returns a map. Is the reason for returning a map because the API does one-level listPath? Is there a use case that needs only one-level expansion? If we eventually need to get the block locations of all files recursively under the input paths, is the following API a better choice? {code} /** * @return the block locations of all files recursively under the input paths */ IteratorBlockLocation getBlockLocations(Path[] paths) {code} When implementing this in HDFS, we might need to issue multiple RPCs and be very careful to limit the size of each RPC request and response. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887600#action_12887600 ] Hairong Kuang commented on HDFS-202: I read FileInputFormat and understand the usecase much better. So the client needs to know FileStatus for filtering and there is a configuration parameter to specify whether the input paths need to be traversed recursively. In this case, how about the following revised API? {code} class FileStatusAndBlockLocations { FileStatus fileStatus; BlockLocation [] blocks; } IteratorFileStatusAndBlockLocations getBlockLocations(Path[] paths, boolean isRecursive); {code} Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Hairong Kuang Fix For: 0.22.0 Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755567#action_12755567 ] Sanjay Radia commented on HDFS-202: --- Maybe we should punt that until someone develops an append-savvy distcp? +1 Why is DetailedFileStatus[] better than MapFileStatus,BlockLocation[]? The latter seems more transparent. I was holding out on a file system interface return a map. But that is old school. Fine I am convinced. I suspect you also want the rpc signature to return a map (that makes me more nervous because most rpcs do not support that - but ours does I guess.). - Wrt to the new FileContext api, my proposal is that its provides a single getBlockLocation method: MapFileStatus,BlockLocation[] getBlockLocations(Path[] path) and abandon the BlockLocation[] getBlockLocations(path, start, end). (of course FileSystem will continue to support the old getBlockLocations.) Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Jakob Homan Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755621#action_12755621 ] dhruba borthakur commented on HDFS-202: --- +1 Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Jakob Homan Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753215#action_12753215 ] Doug Cutting commented on HDFS-202: --- Is the optimization for sending only partial block reports really necessary? It may help the append-savvy distcp use case, but is not in the mapred job submission use case. Even in the append-savvy distcp use case, it's not clear that it's required. Maybe we should punt that until someone develops an append-savvy distcp? Why not create a class called DetailedFileStatus which contains both the file status and block locations: Why is DetailedFileStatus[] better than MapFileStatus,BlockLocation[]? The latter seems more transparent. DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested. That was intended for the append-savvy distcp use case. The original use case was for mapred job submission, where we typically have a list of directories. With directories there is not a 1:1 mapping. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Jakob Homan Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
[ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12752748#action_12752748 ] Sanjay Radia commented on HDFS-202: --- Is the optimization for sending only partial block reports really necessary? Most files have very few blocks ... Also arun's point of doing an extra call for doing the getFileStatus() is valid. Why not create a class called DetailedFileStatus which contains both the file status and block locations: DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested. We can add the range one later if we really need that optimization. Add a bulk FIleSystem.getFileBlockLocations --- Key: HDFS-202 URL: https://issues.apache.org/jira/browse/HDFS-202 Project: Hadoop HDFS Issue Type: New Feature Reporter: Arun C Murthy Assignee: Jakob Homan Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file. The downsides are multiple: # Even with a few thousand files to process the number of RPCs quickly starts getting noticeable # The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'. It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'. When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.