Re: Nutch bug - assumption of HDFS in CrawlDb.java even if using other file systems like S3
Julien, I couldn't find any similar symptoms on JIRA - I'll go ahead and file a new one. Cheers Viksit On Wed, May 25, 2011 at 1:07 PM, Julien Nioche wrote: > > Viksit, > > Please check if this has already been reported on the JIRA and if not open a > new issue (for 2.0) > > Thanks > > Julien > > On 25 May 2011 19:02, Viksit Gaur wrote: >> >> [Cross posting since this might be more relevant here.] >> >> -- >> >> Hi all, >> >> Trying to run nutch on Elastic Mapreduce, I ran into an issue which I >> think is the same as the following, >> >> https://forums.aws.amazon.com/thread.jspa?threadID=54492 >> >> Exception in thread "main" java.lang.IllegalArgumentException: This >> file system object (hdfs://ip-10-122-99-48.ec2.internal:9000) does not >> support access to the request path >> 's3n://mybucketname/crawl/crawldb/current' You possibly called >> FileSystem.get(conf) when you should of called FileSystem.get(uri, >> conf) to obtain a file system supporting your path. >> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:351) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) >> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:688) >> at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:122) >> at org.apache.nutch.crawl.Injector.inject(Injector.java:226) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:119) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> >> It appears that CrawlDb.java uses code that assumes all inputs are on >> HDFS. Is this a known bug - and if so, could someone point me to the >> number, and whether there exists a patch for it? >> >> If not, I'd be happy to contribute one. I'm using Nutch 1.2 that I've >> patched for NUTCH 937 and NUTCH 993. >> >> Cheers >> Viksit > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com
Re: Nutch bug - assumption of HDFS in CrawlDb.java even if using other file systems like S3
Viksit, Please check if this has already been reported on the JIRA and if not open a new issue (for 2.0) Thanks Julien On 25 May 2011 19:02, Viksit Gaur wrote: > [Cross posting since this might be more relevant here.] > > -- > > Hi all, > > Trying to run nutch on Elastic Mapreduce, I ran into an issue which I > think is the same as the following, > > https://forums.aws.amazon.com/thread.jspa?threadID=54492 > > Exception in thread "main" java.lang.IllegalArgumentException: This > file system object (hdfs://ip-10-122-99-48.ec2.internal:9000) does not > support access to the request path > 's3n://mybucketname/crawl/crawldb/current' You possibly called > FileSystem.get(conf) when you should of called FileSystem.get(uri, > conf) to obtain a file system supporting your path. >at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:351) >at > org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99) >at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155) >at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) >at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:688) >at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:122) >at org.apache.nutch.crawl.Injector.inject(Injector.java:226) >at org.apache.nutch.crawl.Crawl.main(Crawl.java:119) >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >at java.lang.reflect.Method.invoke(Method.java:597) >at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > It appears that CrawlDb.java uses code that assumes all inputs are on > HDFS. Is this a known bug - and if so, could someone point me to the > number, and whether there exists a patch for it? > > If not, I'd be happy to contribute one. I'm using Nutch 1.2 that I've > patched for NUTCH 937 and NUTCH 993. > > Cheers > Viksit > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Nutch bug - assumption of HDFS in CrawlDb.java even if using other file systems like S3
[Cross posting since this might be more relevant here.] -- Hi all, Trying to run nutch on Elastic Mapreduce, I ran into an issue which I think is the same as the following, https://forums.aws.amazon.com/thread.jspa?threadID=54492 Exception in thread "main" java.lang.IllegalArgumentException: This file system object (hdfs://ip-10-122-99-48.ec2.internal:9000) does not support access to the request path 's3n://mybucketname/crawl/crawldb/current' You possibly called FileSystem.get(conf) when you should of called FileSystem.get(uri, conf) to obtain a file system supporting your path. at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:351) at org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:688) at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:122) at org.apache.nutch.crawl.Injector.inject(Injector.java:226) at org.apache.nutch.crawl.Crawl.main(Crawl.java:119) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) It appears that CrawlDb.java uses code that assumes all inputs are on HDFS. Is this a known bug - and if so, could someone point me to the number, and whether there exists a patch for it? If not, I'd be happy to contribute one. I'm using Nutch 1.2 that I've patched for NUTCH 937 and NUTCH 993. Cheers Viksit