[ 
https://issues.apache.org/jira/browse/NUTCH-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341680#comment-15341680
 ] 

Sebastian Nagel commented on NUTCH-2281:
----------------------------------------

I tried to fix all tools but haven't tested all of them yet.  Yes, there may be 
some I've overseen :(.  I didn't fix unit tests, rarely used tools (Benchmark, 
DmozParser) and some main() methods which are intended for debugging or 
explicitly take the file system as argument (ParseData, ParseText).  I'll 
continue testing the next days but help is welcome!

> Support non-default FileSystem
> ------------------------------
>
>                 Key: NUTCH-2281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2281
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.12
>            Reporter: Sebastian Nagel
>             Fix For: 1.13
>
>
> If a path (input or output) does not belong to the configured default 
> FileSystem various Nutch tools may raise an exception like
> {noformat}
>   Exception in ... java.lang.IllegalArgumentException: Wrong FS: s3a://..., 
> expected: hdfs://...
> {noformat}
> This is fixed by getting a reference to the FileSystem from the Path object
> {noformat}
>   FileSystem fs = path.getFileSystem(getConf());
> {noformat}
> instead of
> {noformat}
>   FileSystem fs = FileSystem.get(getConf());
> {noformat}
> A given path (e.g., {{s3a://...}}) may not belong to the default file system 
> ({{hdfs://}} or {{file://}} in local mode) and simple checks such as 
> {{fs.exists(path)}} then will fail. Cf. 
> [FileSystem.checkPath(path)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#checkPath(org.apache.hadoop.fs.Path)],
>  and 
> [FileSystem.get(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(org.apache.hadoop.conf.Configuration)]
>  vs. 
> [FileSystem.get(URI,conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI,%20org.apache.hadoop.conf.Configuration)]
>  which is called by 
> [Path.getFileSystem(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/Path.html#getFileSystem%28org.apache.hadoop.conf.Configuration%29].
>   
> Note that the FileSystem for input and output may be different, e.g., read 
> from HDFS and write to S3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to