Hi James,

I have increased the limit in nutch-site.xml (
https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have
created the webpage table based on the fields here (
http://nlp.solutions.asia/?p=180).

The database stills shows the parseStatus as
'–org.apache.nutch.parse.ParseException: Unable to successfully parse
content'.  I am having text field nutch 'null' for them. This the the
screenshot
<https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.27.13%20PM.png>of
mysql database that i have.

Can you please tell me how can i overcome this problem ? This is the
screenshot<https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.36.43%20PM.png>
of
my webpage table.

Many Thanks for your help.

Regards,
Kiran.

On Wed, Oct 17, 2012 at 6:20 AM, <j.sulli...@thomsonreuters.com> wrote:

> Hi Kiran,
>
> I agree with Julien it is probably trimmed content.
>
> I regularly parse PDFs with Nutch 2.x with MySQL as the backend without
> problem (even without the patch).
>
> The differences in my set up from the standard set up that may be
> applicable:
>
> 1) In nutch-site.xml the file.content.limit and http.content.limit are set
> to 6000000.
> 2) I have a custom create webpage table sql script that creates fields
> that can hold more.  The default table fields are not sufficiently large in
> most real world situations. http://nlp.solutions.asia/?p=180
>
> I crawled http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/ and it
> successfully parsed all except one of the PDFs, v29n3.pdf. That PDF is
> almost 20 megs much larger than the limit in nutch-default.xml and even
> larger than that configured in my nutch-site.xml. Interestingly that PDF is
> also completely pictures (what looks like text is actually pictures of
> text) so there may be no real text to parse.
>
> James
>
> ________________________________________
> From: Julien Nioche [lists.digitalpeb...@gmail.com]
> Sent: Wednesday, October 17, 2012 4:17 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files
>
> trimmed content?
>
> On 16 October 2012 22:47, kiran chitturi <chitturikira...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I am running Nutch 2.x with patch here at
> > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a
> mysql
> > database.
> >
> > After the {inject, generate, fetch} commands when i issue the command (sh
> > bin/nutch parse 1350396627-126726428) the parserJob was success but when
> i
> > look inside the database only one pdf file is parsed out of 10.
> >
> > When i look in to hadoop.log it shows the statement '2012-10-16
> > 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully parse content
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
> > application/pdf' like this.
> >
> > The logs of successfully parsed and failed ones are below. The logs below
> > show that pdf file '......./agosto.pdf' is parsed and the file
> > '..../authors.pdf' is not parsed.
> >
> > The same thing happened for all other pdf files, the parse failed. When i
> > do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf
> > files and it does not show any errors.
> >
> >
> > 2012-10-16 16:04:28,150 INFO  parse.ParserJob - Parsing
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf
> > > 2012-10-16 16:04:28,151 INFO  parse.ParserFactory - The parsing
> plugins:
> > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > > plugin.includes system property, and all claim to support the content
> > type
> > > application/pdf, but they are not mapp
> > > ed to it  in the parse-plugins.xml file
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > content-type      application/pdf
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:modified  2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:creation-date        2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:save-date    2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-modified     2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dc:creator        Denise E. Agosto
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:created   2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > creation-date     2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > date
> > >      2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > xmp:creatortool   ScanWizard 5
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > modified  2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > creator   Denise E. Agosto
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > author    Denise E. Agosto
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > xmptpg:npages     4
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:author       Denise E. Agosto
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > created   Wed Oct 20 17:12:47 EDT 2010
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > producer  Adobe Acrobat 9.4 Paper Capture Plug-in
> > > 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-save-date    2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
> > > dc:title  ALAN v29n3 - Facilitating Student Connections to Judith Ortiz
> > > Cofer's The Line of the Sun and Esmeralda Santiago's Almost a Woman
> > > 2012-10-16 16:04:30,631 INFO  parse.ParserJob - Parsing
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf
> > > 2012-10-16 16:04:30,680 WARN  parse.MetaTagsParser - Found meta tag :
> > > content-type      application/pdf
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:creation-date        2010-10-20T21:00:15Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:modified  2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:save-date    2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-modified     2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:created   2010-10-20T21:00:15Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > creation-date     2010-10-20T21:00:15Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > date
> > >      2010-10-20T21:00:15Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > xmp:creatortool   ScanWizard 5
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > modified  2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > xmptpg:npages     1
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > created   Wed Oct 20 17:00:15 EDT 2010
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > producer  Adobe Acrobat 9.4 Paper Capture Plug-in
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-save-date    2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > dc:title  ALAN v29n3 - INSTRUCTIONS FOR AUTHORS
> > > 2012-10-16 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully
> > > parse content
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
> > > application/pdf
> > > 2012-10-16 16:04:30,692 INFO  parse.ParserJob - Parsing
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf
> > >
> >
> > Is there any way i can get more logs about knowing whether the error is
> > file specific or error from internal parser ?
> >
> > Thank you,
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Kiran Chitturi

Reply via email to