RE: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

Markus Jelsma Mon, 15 Oct 2012 14:56:47 -0700

Tika 1.2 has not yet been committed to the 2.x branch so it won't work in any 
case for this specific file. You can help in confirming the ticket so it can be 
committed.


https://issues.apache.org/jira/browse/NUTCH-1433
 
 
-----Original message-----
> From:kiran chitturi <chitturikira...@gmail.com>
> Sent: Mon 15-Oct-2012 23:54
> To: user@nutch.apache.org
> Subject: Re: nutch - Status: failed(2,200): 
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
> 
> I did not change parse-plugins.xml at all. I am using the 2.x branch.
> 
> Many Thanks,
> Kiran.
> 
> On Mon, Oct 15, 2012 at 5:20 PM, Markus Jelsma
> <markus.jel...@openindex.io>wrote:
> 
> > Hi,
> >
> > It complains about not finding a Tika parser for the content type, did you
> > modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails
> > because of PDFbox. I can parse it successfully with trunk, 1.5 is not going
> > to work, not because it cannot find the TikaParser for PDFs but becasue
> > PDFBox cannot handle it.
> >
> > Cheers,
> >
> >
> > -----Original message-----
> > > From:kiran chitturi <chitturikira...@gmail.com>
> > > Sent: Mon 15-Oct-2012 21:58
> > > To: user@nutch.apache.org
> > > Subject: nutch - Status: failed(2,200):
> > org.apache.nutch.parse.ParseException: Unable to successfully parse content
> > >
> > > Hi,
> > >
> > > I am trying to parse pdf files using nutch and its failing everytime with
> > > the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException:
> > > Unable to successfully parse content' in both nutch 1.5 and 2.x series
> > when
> > > i do the command 'sh bin/nutch parsechecker
> > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'.
> > >
> > > The hadoop.log looks like this
> > >
> > > >
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.host = null
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.port = 8080
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.timeout = 10000
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.content.limit = -1
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.agent = My Nutch
> > > > Spider/Nutch-2.2-SNAPSHOT
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept.language =
> > > > en-us,en-gb,en;q=0.7,*;q=0.3
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept =
> > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> > > > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - parsing:
> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf
> > > > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - contentType:
> > > > application/pdf
> > > > 2012-10-15 15:43:36,858 INFO  crawl.SignatureFactory - Using Signature
> > > > impl: org.apache.nutch.crawl.MD5Signature
> > > > 2012-10-15 15:43:36,904 INFO  parse.ParserFactory - The parsing
> > plugins:
> > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > > > plugin.includes system property, and all claim to support the content
> > type
> > > > application/pdf, but they are not mapped to it  in the
> > parse-plugins.xml
> > > > file
> > > > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika
> > parser
> > > > for mime-type application/pdf
> > > > 2012-10-15 15:43:36,969 WARN  parse.ParseUtil - Unable to successfully
> > > > parse content
> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of
> > > > type application/pdf
> > >
> > >
> > > The config file nutch-site.xml is as below:
> > >
> > >  <?xml version="1.0"?>
> > > >
> > > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > > > <!-- Put site-specific property overrides in this file. -->
> > > > <configuration>
> > > > <property>
> > > >  <name>http.agent.name</name>
> > > >  <value>My Nutch Spider</value>
> > > > </property>
> > > >
> > > > <property>
> > > > <name>plugin.folders</name>
> > > >
> > <value>/Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins
> > > > </value>
> > > > </property>
> > > >
> > > > <property>
> > > > <name>plugin.includes</name>
> > > > <value>
> > > >
> > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)
> > > > </value>
> > > > </property>
> > > > <!-- Used only if plugin parse-metatags is enabled. -->
> > > > <property>
> > > > <name>metatags.names</name>
> > > > <value>*</value>
> > > > <description> Names of the metatags to extract, separated by;.
> > > >   Use '*' to extract all metatags. Prefixes the names with 'metatag.'
> > > >   in the parse-metadata. For instance to index description and
> > keywords,
> > > >   you need to activate the plugin index-metadata and set the value of
> > the
> > > >   parameter 'index.parse.md' to
> > 'metatag.description;metatag.keywords'.
> > > > </description>
> > > > </property>
> > > > <property>
> > > >   <name>index.parse.md</name>
> > > >   <value>
> > > >
> > dc.creator,dc.bibliographiccitation,dcterms.issued,content-type,dcterms.bibliographiccitation,dc.format,dc.type,dc.language,dc.contributor,originalcharencoding,dc.publisher,dc.title,charencodingforconversion
> > > > </value>
> > > >   <description>
> > > >   Comma-separated list of keys to be taken from the parse metadata to
> > > > generate fields.
> > > >   Can be used e.g. for 'description' or 'keywords' provided that these
> > > > values are generated
> > > >   by a parser (see parse-metatags plugin)
> > > >   </description>
> > > > </property>
> > > > <property>
> > > > <name>http.content.limit</name>
> > > > <value>-1</value>
> > > > </property>
> > > > </configuration>
> > > >
> > > > Are there any configuration settings that i need to do to work with pdf
> > > files ? I have parsed them before and crawled but i am not sure which is
> > > causing the error now.
> > >
> > > Can someone please point the cause of the errors above ?
> > >
> > > Many Thanks,
> > > --
> > > Kiran Chitturi
> > >
> >
> 
> 
> 
> -- 
> Kiran Chitturi
>

RE: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

Reply via email to