[ 
http://issues.apache.org/jira/browse/HADOOP-619?page=comments#action_12457340 ] 
            
Sanjay Dahiya commented on HADOOP-619:
--------------------------------------

thanks Doug for the review. 

For using InputFormatBase for validating and expanding globs, i think it would 
be a good idea to change the signature of 
InputFormatBase.areValidInputDirectories() to return a list of valid Paths as 
part of this patch itself. If some Path is found to be invalid, that will cause 
the job to fail it should throw InvalidArgumentException. 
This will prevent the glob expansion to happen twice, once on job client end ( 
InputFormatBase.areValidInputDirectories() ) and then on the jobtracker end 
while assigning splits ( InputFormatBase.listPaths() ) 

areValidInputDirectories a public method in InputFormat and in hadoop it's used 
only in JobClient. Otherwise we will need to expand the globs twice/

Does this sound reasonable?

> Unify Map-Reduce and Streaming to take the same globbed input specification
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-619
>                 URL: http://issues.apache.org/jira/browse/HADOOP-619
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: eric baldeschwieler
>         Assigned To: Sanjay Dahiya
>         Attachments: Hadoop-619.patch, Hadoop-619.patch, Hadoop-619.patch
>
>
> Right now streaming input is specified very differently from other map-reduce 
> input.  It would be good if these two apps could take much more similar input 
> specs.
> In particular -input in streaming expects a file or glob pattern while MR 
> takes a directory.  It would be cool if both could take a glob patern of 
> files and if both took a directory by default (with some patern excluded to 
> allow logs, metadata and other framework output to be safely stored).
> We want to be sure that MR input is backward compatible over this change.  I 
> propose that a single file should be accepted as an input or a single 
> directory.  Globs should only match directories if the paterns is '/' 
> terminated, to avoid massive inputs specified by mistake.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to