Re: Nutch bug - assumption of HDFS in CrawlDb.java even if using other file systems like S3

2011-05-30 Thread Viksit Gaur
Julien,

I couldn't find any similar symptoms on JIRA - I'll go ahead and file a new one.

Cheers
Viksit

On Wed, May 25, 2011 at 1:07 PM, Julien Nioche
 wrote:
>
> Viksit,
>
> Please check if this has already been reported on the JIRA and if not open a 
> new issue (for 2.0)
>
> Thanks
>
> Julien
>
> On 25 May 2011 19:02, Viksit Gaur  wrote:
>>
>> [Cross posting since this might be more relevant here.]
>>
>> --
>>
>> Hi all,
>>
>> Trying to run nutch on Elastic Mapreduce, I ran into an issue which I
>> think is the same as the following,
>>
>> https://forums.aws.amazon.com/thread.jspa?threadID=54492
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: This
>> file system object (hdfs://ip-10-122-99-48.ec2.internal:9000) does not
>> support access to the request path
>> 's3n://mybucketname/crawl/crawldb/current' You possibly called
>> FileSystem.get(conf) when you should of called FileSystem.get(uri,
>> conf) to obtain a file system supporting your path.
>>        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:351)
>>        at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99)
>>        at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155)
>>        at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:688)
>>        at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:122)
>>        at org.apache.nutch.crawl.Injector.inject(Injector.java:226)
>>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:119)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>> It appears that CrawlDb.java uses code that assumes all inputs are on
>> HDFS. Is this a known bug - and if so, could someone point me to the
>> number, and whether there exists a patch for it?
>>
>> If not, I'd be happy to contribute one. I'm using Nutch 1.2 that I've
>> patched for NUTCH 937 and NUTCH 993.
>>
>> Cheers
>> Viksit
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com


Re: Nutch bug - assumption of HDFS in CrawlDb.java even if using other file systems like S3

2011-05-25 Thread Julien Nioche
Viksit,

Please check if this has already been reported on the JIRA and if not open a
new issue (for 2.0)

Thanks

Julien

On 25 May 2011 19:02, Viksit Gaur  wrote:

> [Cross posting since this might be more relevant here.]
>
> --
>
> Hi all,
>
> Trying to run nutch on Elastic Mapreduce, I ran into an issue which I
> think is the same as the following,
>
> https://forums.aws.amazon.com/thread.jspa?threadID=54492
>
> Exception in thread "main" java.lang.IllegalArgumentException: This
> file system object (hdfs://ip-10-122-99-48.ec2.internal:9000) does not
> support access to the request path
> 's3n://mybucketname/crawl/crawldb/current' You possibly called
> FileSystem.get(conf) when you should of called FileSystem.get(uri,
> conf) to obtain a file system supporting your path.
>at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:351)
>at
> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99)
>at
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155)
>at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:688)
>at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:122)
>at org.apache.nutch.crawl.Injector.inject(Injector.java:226)
>at org.apache.nutch.crawl.Crawl.main(Crawl.java:119)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> It appears that CrawlDb.java uses code that assumes all inputs are on
> HDFS. Is this a known bug - and if so, could someone point me to the
> number, and whether there exists a patch for it?
>
> If not, I'd be happy to contribute one. I'm using Nutch 1.2 that I've
> patched for NUTCH 937 and NUTCH 993.
>
> Cheers
> Viksit
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Nutch bug - assumption of HDFS in CrawlDb.java even if using other file systems like S3

2011-05-25 Thread Viksit Gaur
[Cross posting since this might be more relevant here.]

--

Hi all,

Trying to run nutch on Elastic Mapreduce, I ran into an issue which I
think is the same as the following,

https://forums.aws.amazon.com/thread.jspa?threadID=54492

Exception in thread "main" java.lang.IllegalArgumentException: This
file system object (hdfs://ip-10-122-99-48.ec2.internal:9000) does not
support access to the request path
's3n://mybucketname/crawl/crawldb/current' You possibly called
FileSystem.get(conf) when you should of called FileSystem.get(uri,
conf) to obtain a file system supporting your path.
       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:351)
       at 
org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99)
       at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155)
       at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
       at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:688)
       at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:122)
       at org.apache.nutch.crawl.Injector.inject(Injector.java:226)
       at org.apache.nutch.crawl.Crawl.main(Crawl.java:119)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

It appears that CrawlDb.java uses code that assumes all inputs are on
HDFS. Is this a known bug - and if so, could someone point me to the
number, and whether there exists a patch for it?

If not, I'd be happy to contribute one. I'm using Nutch 1.2 that I've
patched for NUTCH 937 and NUTCH 993.

Cheers
Viksit