Nutch bug - assumption of HDFS in CrawlDb.java even if using other file systems like S3

Viksit Gaur Wed, 25 May 2011 11:02:50 -0700

[Cross posting since this might be more relevant here.]

--


Hi all,

Trying to run nutch on Elastic Mapreduce, I ran into an issue which I
think is the same as the following,

https://forums.aws.amazon.com/thread.jspa?threadID=54492

Exception in thread "main" java.lang.IllegalArgumentException: This
file system object (hdfs://ip-10-122-99-48.ec2.internal:9000) does not
support access to the request path
's3n://mybucketname/crawl/crawldb/current' You possibly called
FileSystem.get(conf) when you should of called FileSystem.get(uri,
conf) to obtain a file system supporting your path.
       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:351)
       at 
org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99)
       at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155)
       at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
       at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:688)
       at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:122)
       at org.apache.nutch.crawl.Injector.inject(Injector.java:226)
       at org.apache.nutch.crawl.Crawl.main(Crawl.java:119)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

It appears that CrawlDb.java uses code that assumes all inputs are on
HDFS. Is this a known bug - and if so, could someone point me to the
number, and whether there exists a patch for it?

If not, I'd be happy to contribute one. I'm using Nutch 1.2 that I've
patched for NUTCH 937 and NUTCH 993.

Cheers
Viksit

Nutch bug - assumption of HDFS in CrawlDb.java even if using other file systems like S3

Reply via email to