Re: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf
Thank you very much. This has worked great and resolved the issue of finding parser. One interesting thing is out of 10 pdf files, it has crawled 2 files and said unsuccessful for other pdf files. This has happened like 10 times for now. I really need to debug and put more error messages than just 'unable to succesfully parse content ..' Thanks again, Kiran. On Fri, Oct 26, 2012 at 4:16 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > > > > Is there anything wrong with my eclipse configuration? I am looking to > > debug some things in nutch, so i am working with eclipse and nutch. > > > easier to follow the steps in Remote Debugging in Eclipse from > http://wiki.apache.org/nutch/RunNutchInEclipse > > it will save you all sorts of classpath issues etc... note that this works > in local mode only > > HTH > > Julien > > > On 25 October 2012 19:44, kiran chitturi > wrote: > > > Hi, > > > > i have built Nutch 2.x in eclipse using this tutorial ( > > http://wiki.apache.org/nutch/RunNutchInEclipse) and with some > > modifications. > > > > Its able to parse html files successfully but when it comes to pdf files > it > > says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika > > parser for mime-type application/pdf > > > > Is there anything wrong with my eclipse configuration? I am looking to > > debug some things in nutch, so i am working with eclipse and nutch. > > > > Do i need to point any libraries for eclipseto recognize tika parsers for > > application/pdf type ? > > > > What exactly is the reason for this type of error to appear for only pdf > > files and not html files ? I am using recent nutch 2.x which has tika > > upgraded to 1.2 > > > > I would like some help here and would like to know if anyone has > > encountered similar problem with eclipse, nutch 2.x and parsing > > application/pdf files ? > > > > Many Thanks, > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi
RE: How to recover data from /tmp/hadoop-myuser
Hi Markus, Thanks for the tip. Is there any wiki page that talks about Nutch best practices? So that next time I don't waste 3 days and almost 100 G of data :-( Thanks, Mohammad From: Markus Jelsma ; To: user@nutch.apache.org ; Subject: RE: How to recover data from /tmp/hadoop-myuser Sent: Fri, Oct 26, 2012 1:10:49 PM Hi, You cannot recover the mapper output as far as i know. But anyway, one should never have a fetcher running for three days. It's far better to generate a large amount of smaller segments and fetch them sequentially. If an error occurs, only a small portion is affected. We never run fetchers for more than one hour, instead we run many in a row and sometimes concurrently. Cheers, -Original message- > From:Mohammad wrk > Sent: Fri 26-Oct-2012 00:47 > To: user@nutch.apache.org > Subject: How to recover data from /tmp/hadoop-myuser > > Hi, > > > > My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed, after 3 days, with the error below. Under the segment folder (./segments/20121021205343/) there is only generated fetch list (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of data. I was wondering if there is a way to recover this data and parse the segment? > > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for output/file.out > > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) > at > org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: > Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318) > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327) > > > Thanks, > Mohammad
Re: Nutch2.1 problems
Hi, On Tue, Oct 23, 2012 at 2:42 PM, Mouradk wrote: > This sits in a urls/seed.txt in NUTCH_HOME (not runtime folder but the home > folder generated after unzipping). Please put the urls directory (with the seed file for bootstrapping) into /runtime/local and run the command from the script in /runtime/local//bin this will work perfectly as explained in the tutorial. hth Lewis
Re: nutch on AWS EMR.
Hi, On Thu, Oct 25, 2012 at 3:03 PM, manubharghav wrote: > Will providing a core-site.xml overwriting some of the permission in > core-default.xml in hadoop jar help ?? It's certainly something I would try. Also have you tried using the Nutch script at all? If you can get this working you will find it much easier to execute commands. We could really do with a tutorial on getting Nutch up and running on a cloud-based platform(s) Lewis
RE: How to recover data from /tmp/hadoop-myuser
Hi - there's a similar entry already, however, the fetcher.done part doesn't seem to be correct. I can see no reason why that would ever work as Hadoop temp files are simply no copied to the segment if it fails. There's also no notion of an fetcher.done file in trunk. http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F -Original message- > From:Lewis John Mcgibbney > Sent: Fri 26-Oct-2012 15:15 > To: user@nutch.apache.org > Subject: Re: How to recover data from /tmp/hadoop-myuser > > I really think this should be in the FAQ's? > > http://wiki.apache.org/nutch/FAQ > > On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma > wrote: > > Hi, > > > > You cannot recover the mapper output as far as i know. But anyway, one > > should never have a fetcher running for three days. It's far better to > > generate a large amount of smaller segments and fetch them sequentially. If > > an error occurs, only a small portion is affected. We never run fetchers > > for more than one hour, instead we run many in a row and sometimes > > concurrently. > > > > Cheers, > > > > > > -Original message- > >> From:Mohammad wrk > >> Sent: Fri 26-Oct-2012 00:47 > >> To: user@nutch.apache.org > >> Subject: How to recover data from /tmp/hadoop-myuser > >> > >> Hi, > >> > >> > >> > >> My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) > >> failed, after 3 days, with the error below. Under the segment folder > >> (./segments/20121021205343/) there is only generated fetch list > >> (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of > >> data. I was wondering if there is a way to recover this data and parse the > >> segment? > >> > >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > >> valid local directory for output/file.out > >> > >> at > >> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) > >> at > >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) > >> at > >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) > >> at > >> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) > >> at > >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640) > >> at > >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323) > >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437) > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > >> at > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >> 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: > >> java.io.IOException: Job failed! > >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) > >> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318) > >> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354) > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327) > >> > >> > >> Thanks, > >> Mohammad > > > > -- > Lewis >
Re: How to recover data from /tmp/hadoop-myuser
I really think this should be in the FAQ's? http://wiki.apache.org/nutch/FAQ On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma wrote: > Hi, > > You cannot recover the mapper output as far as i know. But anyway, one should > never have a fetcher running for three days. It's far better to generate a > large amount of smaller segments and fetch them sequentially. If an error > occurs, only a small portion is affected. We never run fetchers for more than > one hour, instead we run many in a row and sometimes concurrently. > > Cheers, > > > -Original message- >> From:Mohammad wrk >> Sent: Fri 26-Oct-2012 00:47 >> To: user@nutch.apache.org >> Subject: How to recover data from /tmp/hadoop-myuser >> >> Hi, >> >> >> >> My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed, >> after 3 days, with the error below. Under the segment folder >> (./segments/20121021205343/) there is only generated fetch list >> (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of >> data. I was wondering if there is a way to recover this data and parse the >> segment? >> >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any >> valid local directory for output/file.out >> >> at >> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) >> at >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) >> at >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) >> at >> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323) >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) >> 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: >> java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) >> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318) >> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327) >> >> >> Thanks, >> Mohammad -- Lewis
RE: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf
Hi, -Original message- > From:kiran chitturi > Sent: Thu 25-Oct-2012 20:49 > To: user@nutch.apache.org > Subject: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type > application/pdf > > Hi, > > i have built Nutch 2.x in eclipse using this tutorial ( > http://wiki.apache.org/nutch/RunNutchInEclipse) and with some modifications. > > Its able to parse html files successfully but when it comes to pdf files it > says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika > parser for mime-type application/pdf > > Is there anything wrong with my eclipse configuration? I am looking to > debug some things in nutch, so i am working with eclipse and nutch. > > Do i need to point any libraries for eclipseto recognize tika parsers for > application/pdf type ? > > What exactly is the reason for this type of error to appear for only pdf > files and not html files ? I am using recent nutch 2.x which has tika > upgraded to 1.2 This is possible if the PDFBox dependancy is not found anywhere or is wrongly mapped in Tika's plugin.xml. The above error can also happen if you happen to have a tika-parsers-VERSION.jar in your runtime/local/lib directory, for some strange reason. > > I would like some help here and would like to know if anyone has > encountered similar problem with eclipse, nutch 2.x and parsing > application/pdf files ? > > Many Thanks, > -- > Kiran Chitturi >
RE: How to recover data from /tmp/hadoop-myuser
Hi, You cannot recover the mapper output as far as i know. But anyway, one should never have a fetcher running for three days. It's far better to generate a large amount of smaller segments and fetch them sequentially. If an error occurs, only a small portion is affected. We never run fetchers for more than one hour, instead we run many in a row and sometimes concurrently. Cheers, -Original message- > From:Mohammad wrk > Sent: Fri 26-Oct-2012 00:47 > To: user@nutch.apache.org > Subject: How to recover data from /tmp/hadoop-myuser > > Hi, > > > > My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed, > after 3 days, with the error below. Under the segment folder > (./segments/20121021205343/) there is only generated fetch list > (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of data. > I was wondering if there is a way to recover this data and parse the segment? > > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for output/file.out > > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) > at > org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: > Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318) > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327) > > > Thanks, > Mohammad
Re: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf
> > Is there anything wrong with my eclipse configuration? I am looking to > debug some things in nutch, so i am working with eclipse and nutch. easier to follow the steps in Remote Debugging in Eclipse from http://wiki.apache.org/nutch/RunNutchInEclipse it will save you all sorts of classpath issues etc... note that this works in local mode only HTH Julien On 25 October 2012 19:44, kiran chitturi wrote: > Hi, > > i have built Nutch 2.x in eclipse using this tutorial ( > http://wiki.apache.org/nutch/RunNutchInEclipse) and with some > modifications. > > Its able to parse html files successfully but when it comes to pdf files it > says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika > parser for mime-type application/pdf > > Is there anything wrong with my eclipse configuration? I am looking to > debug some things in nutch, so i am working with eclipse and nutch. > > Do i need to point any libraries for eclipseto recognize tika parsers for > application/pdf type ? > > What exactly is the reason for this type of error to appear for only pdf > files and not html files ? I am using recent nutch 2.x which has tika > upgraded to 1.2 > > I would like some help here and would like to know if anyone has > encountered similar problem with eclipse, nutch 2.x and parsing > application/pdf files ? > > Many Thanks, > -- > Kiran Chitturi > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble