[jira] Updated: (MAPREDUCE-1981) Improve getSplits performance by using listFiles, the new FileSystem API

Chris Douglas (JIRA) Fri, 13 Aug 2010 17:56:44 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Douglas updated MAPREDUCE-1981:
-------------------------------------

    Status: Open  (was: Patch Available)

I'm pleased to see this feature propagate to MR. The approach looks correct, 
just a few comments:

* It looks like this change:
{noformat}
-    return result.toArray(new FileStatus[result.size()]);
+    return result.toArray(new LocatedFileStatus[result.size()]);
{noformat}
Causes {{TestMapRed}} to fail. {{SequenceFileInputFormat}} (and, presumably, 
other supertypes of {{FileInputFormat}}) may rely on the type of the array 
returned from {{FileInputFormat}} to be {{FileStatus[]}}
* I think the HDFS fault injection is breaking the publishing of that artifact, 
so the mapred tests currently do not recognize the change to the HDFS 
ClientProtocol and {{TestSubmitJob}} fails to compile. However, the patch is 
current with HDFS trunk and disabling the fault injection before running 
mvn-install, etc. works. Is this fault being tracked in HDFS?
* The patch causes {{TestNoDefaultsJobConf}} to fail:
{noformat}
Testcase: testNoDefaults took 4.489 sec
  Caused an ERROR
No AbstractFileSystem for scheme: hdfs
org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for 
scheme: hdfs
  at 
org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:143)
  at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:198)   
  at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:394)      
  at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:409)      
  at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:188)   
      
  at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:234)    
        
  at 
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:461)  
      
  at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:453)
  at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:354)
     
  at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1037)                       
  at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1034)                       
  at java.security.AccessController.doPrivileged(Native Method)                 
  at javax.security.auth.Subject.doAs(Subject.java:396)                         
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1030)
    
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:1034)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:536)           
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:781)              
  at 
org.apache.hadoop.conf.TestNoDefaultsJobConf.testNoDefaults(TestNoDefaultsJobConf.java:83)
       
{noformat}
* Unfortunately, {{FileInputFormat::addInputPathRecursively}} could be 
overridden by a user. This should either be marked as an incompatible change or 
the function should be deprecated, but its functionality preserved. It may also 
be worth confirming that no test relies on it.

> Improve getSplits performance by using listFiles, the new FileSystem API
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1981
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1981
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: job submission
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.22.0
>
>         Attachments: mapredListFiles.patch, mapredListFiles1.patch, 
> mapredListFiles2.patch
>
>
> This jira will make FileInputFormat and CombinedFileInputForm to use the new 
> API, thus reducing the number of RPCs to HDFS NameNode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1981) Improve getSplits performance by using listFiles, the new FileSystem API

Reply via email to