Re: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf

2012-10-26 Thread kiran chitturi
Thank you very much. This has worked great and resolved the issue of
finding parser.

One interesting thing is out of 10 pdf files, it has crawled 2 files and
said unsuccessful for other pdf files. This has happened like 10 times for
now.

I really need to debug and put more error messages than just 'unable to
succesfully parse content ..'

Thanks again,
Kiran.

On Fri, Oct 26, 2012 at 4:16 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> >
> > Is there anything wrong with my eclipse configuration? I am looking to
> > debug some  things in nutch, so i am working with eclipse and nutch.
>
>
> easier to follow the steps in Remote Debugging in Eclipse from
> http://wiki.apache.org/nutch/RunNutchInEclipse
>
> it will save you all sorts of classpath issues etc... note that this works
> in local mode only
>
> HTH
>
> Julien
>
>
> On 25 October 2012 19:44, kiran chitturi 
> wrote:
>
> > Hi,
> >
> > i have built Nutch 2.x in eclipse using this tutorial (
> > http://wiki.apache.org/nutch/RunNutchInEclipse) and with some
> > modifications.
> >
> > Its able to parse html files successfully but when it comes to pdf files
> it
> > says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika
> > parser for mime-type application/pdf
> >
> > Is there anything wrong with my eclipse configuration? I am looking to
> > debug some  things in nutch, so i am working with eclipse and nutch.
> >
> > Do i need to point any libraries for eclipseto recognize tika parsers for
> > application/pdf type ?
> >
> > What exactly is the reason for this type of error to appear for only pdf
> > files and not html files ? I am using recent nutch 2.x which has tika
> > upgraded to 1.2
> >
> > I would like some help here and would like to know if anyone has
> > encountered similar problem with eclipse, nutch 2.x and parsing
> > application/pdf files ?
> >
> > Many Thanks,
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Kiran Chitturi


RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Mohammad wrk
Hi Markus,


Thanks for the tip. Is there any wiki page that talks about Nutch best 
practices? So that next time I don't waste 3 days and almost 100 G of data :-(

Thanks,
 
Mohammad




 From:  Markus Jelsma ; 
To:  user@nutch.apache.org ; 
Subject:  RE: How to recover data from /tmp/hadoop-myuser 
Sent:  Fri, Oct 26, 2012 1:10:49 PM 
 

Hi,

You cannot recover the mapper output as far as i know. But anyway, one should 
never have a fetcher running for three days. It's far better to generate a 
large amount of smaller segments and fetch them sequentially. If an error 
occurs, only a small portion is affected. We never run fetchers for more than 
one hour, instead we run many in a row and sometimes concurrently.

Cheers,


-Original message-
> From:Mohammad wrk 
> Sent: Fri 26-Oct-2012 00:47
> To: user@nutch.apache.org
> Subject: How to recover data from /tmp/hadoop-myuser
> 
> Hi,
> 
> 
> 
> My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed,
 after 3 days, with the error below. Under the segment folder 
(./segments/20121021205343/) there is only generated fetch list 
(crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of data. I 
was wondering if there is a way to recover this data and parse the segment?
> 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> valid local directory for output/file.out
> 
>         at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>         at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>         at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>         at 
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>         at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640)
>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
> Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327)
> 
> 
>
 Thanks,
> Mohammad

Re: Nutch2.1 problems

2012-10-26 Thread Lewis John Mcgibbney
Hi,

On Tue, Oct 23, 2012 at 2:42 PM, Mouradk  wrote:

> This sits in a urls/seed.txt in NUTCH_HOME (not runtime folder but the home 
> folder generated after unzipping).

Please put the urls directory (with the seed file for bootstrapping)
into /runtime/local and run the command from the script in
/runtime/local//bin this will work perfectly as explained in the
tutorial.

hth

Lewis


Re: nutch on AWS EMR.

2012-10-26 Thread Lewis John Mcgibbney
Hi,

On Thu, Oct 25, 2012 at 3:03 PM, manubharghav  wrote:

> Will providing a core-site.xml overwriting some of the permission in
> core-default.xml in hadoop jar help ??

It's certainly something I would try.

Also have you tried using the Nutch script at all? If you can get this
working you will find it much easier to execute commands.

We could really do with a tutorial on getting Nutch up and running on
a cloud-based platform(s)

Lewis


RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma
Hi - there's a similar entry already, however, the fetcher.done part doesn't 
seem to be correct. I can see no reason why that would ever work as Hadoop temp 
files are simply no copied to the segment if it fails. There's also no notion 
of an fetcher.done file in trunk.

http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F

 
 
-Original message-
> From:Lewis John Mcgibbney 
> Sent: Fri 26-Oct-2012 15:15
> To: user@nutch.apache.org
> Subject: Re: How to recover data from /tmp/hadoop-myuser
> 
> I really think this should be in the FAQ's?
> 
> http://wiki.apache.org/nutch/FAQ
> 
> On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma
>  wrote:
> > Hi,
> >
> > You cannot recover the mapper output as far as i know. But anyway, one 
> > should never have a fetcher running for three days. It's far better to 
> > generate a large amount of smaller segments and fetch them sequentially. If 
> > an error occurs, only a small portion is affected. We never run fetchers 
> > for more than one hour, instead we run many in a row and sometimes 
> > concurrently.
> >
> > Cheers,
> >
> >
> > -Original message-
> >> From:Mohammad wrk 
> >> Sent: Fri 26-Oct-2012 00:47
> >> To: user@nutch.apache.org
> >> Subject: How to recover data from /tmp/hadoop-myuser
> >>
> >> Hi,
> >>
> >>
> >>
> >> My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) 
> >> failed, after 3 days, with the error below. Under the segment folder 
> >> (./segments/20121021205343/) there is only generated fetch list 
> >> (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of 
> >> data. I was wondering if there is a way to recover this data and parse the 
> >> segment?
> >>
> >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> >> valid local directory for output/file.out
> >>
> >> at 
> >> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
> >> at 
> >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
> >> at 
> >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
> >> at 
> >> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
> >> at 
> >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640)
> >> at 
> >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323)
> >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >> at 
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >> 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: 
> >> java.io.IOException: Job failed!
> >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> >> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318)
> >> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327)
> >>
> >>
> >> Thanks,
> >> Mohammad
> 
> 
> 
> -- 
> Lewis
> 


Re: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Lewis John Mcgibbney
I really think this should be in the FAQ's?

http://wiki.apache.org/nutch/FAQ

On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma
 wrote:
> Hi,
>
> You cannot recover the mapper output as far as i know. But anyway, one should 
> never have a fetcher running for three days. It's far better to generate a 
> large amount of smaller segments and fetch them sequentially. If an error 
> occurs, only a small portion is affected. We never run fetchers for more than 
> one hour, instead we run many in a row and sometimes concurrently.
>
> Cheers,
>
>
> -Original message-
>> From:Mohammad wrk 
>> Sent: Fri 26-Oct-2012 00:47
>> To: user@nutch.apache.org
>> Subject: How to recover data from /tmp/hadoop-myuser
>>
>> Hi,
>>
>>
>>
>> My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed, 
>> after 3 days, with the error below. Under the segment folder 
>> (./segments/20121021205343/) there is only generated fetch list 
>> (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of 
>> data. I was wondering if there is a way to recover this data and parse the 
>> segment?
>>
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
>> valid local directory for output/file.out
>>
>> at 
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>> at 
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>> at 
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>> at 
>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>> at 
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640)
>> at 
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: 
>> java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318)
>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327)
>>
>>
>> Thanks,
>> Mohammad



-- 
Lewis


RE: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf

2012-10-26 Thread Markus Jelsma
Hi,
 
-Original message-
> From:kiran chitturi 
> Sent: Thu 25-Oct-2012 20:49
> To: user@nutch.apache.org
> Subject: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type 
> application/pdf
> 
> Hi,
> 
> i have built Nutch 2.x in eclipse using this tutorial (
> http://wiki.apache.org/nutch/RunNutchInEclipse) and with some modifications.
> 
> Its able to parse html files successfully but when it comes to pdf files it
> says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika
> parser for mime-type application/pdf
> 
> Is there anything wrong with my eclipse configuration? I am looking to
> debug some  things in nutch, so i am working with eclipse and nutch.
> 
> Do i need to point any libraries for eclipseto recognize tika parsers for
> application/pdf type ?
> 
> What exactly is the reason for this type of error to appear for only pdf
> files and not html files ? I am using recent nutch 2.x which has tika
> upgraded to 1.2

This is possible if the PDFBox dependancy is not found anywhere or is wrongly 
mapped in Tika's plugin.xml. The above error can also happen if you happen to 
have a tika-parsers-VERSION.jar in your runtime/local/lib directory, for some 
strange reason.

> 
> I would like some help here and would like to know if anyone has
> encountered similar problem with eclipse, nutch 2.x and parsing
> application/pdf files ?
> 
> Many Thanks,
> -- 
> Kiran Chitturi
> 


RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma
Hi,

You cannot recover the mapper output as far as i know. But anyway, one should 
never have a fetcher running for three days. It's far better to generate a 
large amount of smaller segments and fetch them sequentially. If an error 
occurs, only a small portion is affected. We never run fetchers for more than 
one hour, instead we run many in a row and sometimes concurrently.

Cheers,

 
-Original message-
> From:Mohammad wrk 
> Sent: Fri 26-Oct-2012 00:47
> To: user@nutch.apache.org
> Subject: How to recover data from /tmp/hadoop-myuser
> 
> Hi,
> 
> 
> 
> My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed, 
> after 3 days, with the error below. Under the segment folder 
> (./segments/20121021205343/) there is only generated fetch list 
> (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of data. 
> I was wondering if there is a way to recover this data and parse the segment?
> 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> valid local directory for output/file.out
> 
>         at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>         at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>         at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>         at 
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640)
>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
> Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327)
> 
> 
> Thanks,
> Mohammad


Re: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf

2012-10-26 Thread Julien Nioche
>
> Is there anything wrong with my eclipse configuration? I am looking to
> debug some  things in nutch, so i am working with eclipse and nutch.


easier to follow the steps in Remote Debugging in Eclipse from
http://wiki.apache.org/nutch/RunNutchInEclipse

it will save you all sorts of classpath issues etc... note that this works
in local mode only

HTH

Julien


On 25 October 2012 19:44, kiran chitturi  wrote:

> Hi,
>
> i have built Nutch 2.x in eclipse using this tutorial (
> http://wiki.apache.org/nutch/RunNutchInEclipse) and with some
> modifications.
>
> Its able to parse html files successfully but when it comes to pdf files it
> says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika
> parser for mime-type application/pdf
>
> Is there anything wrong with my eclipse configuration? I am looking to
> debug some  things in nutch, so i am working with eclipse and nutch.
>
> Do i need to point any libraries for eclipseto recognize tika parsers for
> application/pdf type ?
>
> What exactly is the reason for this type of error to appear for only pdf
> files and not html files ? I am using recent nutch 2.x which has tika
> upgraded to 1.2
>
> I would like some help here and would like to know if anyone has
> encountered similar problem with eclipse, nutch 2.x and parsing
> application/pdf files ?
>
> Many Thanks,
> --
> Kiran Chitturi
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble