Re: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf

2012-10-26 Thread kiran chitturi
Thank you very much. This has worked great and resolved the issue of finding parser. One interesting thing is out of 10 pdf files, it has crawled 2 files and said unsuccessful for other pdf files. This has happened like 10 times for now. I really need to debug and put more error messages than jus

RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Mohammad wrk
Hi Markus, Thanks for the tip. Is there any wiki page that talks about Nutch best practices? So that next time I don't waste 3 days and almost 100 G of data :-( Thanks, Mohammad From: Markus Jelsma ; To: user@nutch.apache.org ; Subject: RE: How to rec

Re: Nutch2.1 problems

2012-10-26 Thread Lewis John Mcgibbney
Hi, On Tue, Oct 23, 2012 at 2:42 PM, Mouradk wrote: > This sits in a urls/seed.txt in NUTCH_HOME (not runtime folder but the home > folder generated after unzipping). Please put the urls directory (with the seed file for bootstrapping) into /runtime/local and run the command from the script in

Re: nutch on AWS EMR.

2012-10-26 Thread Lewis John Mcgibbney
Hi, On Thu, Oct 25, 2012 at 3:03 PM, manubharghav wrote: > Will providing a core-site.xml overwriting some of the permission in > core-default.xml in hadoop jar help ?? It's certainly something I would try. Also have you tried using the Nutch script at all? If you can get this working you will

RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma
Hi - there's a similar entry already, however, the fetcher.done part doesn't seem to be correct. I can see no reason why that would ever work as Hadoop temp files are simply no copied to the segment if it fails. There's also no notion of an fetcher.done file in trunk. http://wiki.apache.org/nut

Re: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Lewis John Mcgibbney
I really think this should be in the FAQ's? http://wiki.apache.org/nutch/FAQ On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma wrote: > Hi, > > You cannot recover the mapper output as far as i know. But anyway, one should > never have a fetcher running for three days. It's far better to generate a

RE: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf

2012-10-26 Thread Markus Jelsma
Hi, -Original message- > From:kiran chitturi > Sent: Thu 25-Oct-2012 20:49 > To: user@nutch.apache.org > Subject: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type > application/pdf > > Hi, > > i have built Nutch 2.x in eclipse using this tutorial ( > http://wiki.apache.org/

RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma
Hi, You cannot recover the mapper output as far as i know. But anyway, one should never have a fetcher running for three days. It's far better to generate a large amount of smaller segments and fetch them sequentially. If an error occurs, only a small portion is affected. We never run fetchers

Re: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf

2012-10-26 Thread Julien Nioche
> > Is there anything wrong with my eclipse configuration? I am looking to > debug some things in nutch, so i am working with eclipse and nutch. easier to follow the steps in Remote Debugging in Eclipse from http://wiki.apache.org/nutch/RunNutchInEclipse it will save you all sorts of classpath