Hi James, I have increased the limit in nutch-site.xml ( https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have created the webpage table based on the fields here ( http://nlp.solutions.asia/?p=180).
The database stills shows the parseStatus as 'org.apache.nutch.parse.ParseException: Unable to successfully parse content'. I am having text field nutch 'null' for them. This the the screenshot <https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.27.13%20PM.png>of mysql database that i have. Can you please tell me how can i overcome this problem ? This is the screenshot<https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.36.43%20PM.png> of my webpage table. Many Thanks for your help. Regards, Kiran. On Wed, Oct 17, 2012 at 6:20 AM, <j.sulli...@thomsonreuters.com> wrote: > Hi Kiran, > > I agree with Julien it is probably trimmed content. > > I regularly parse PDFs with Nutch 2.x with MySQL as the backend without > problem (even without the patch). > > The differences in my set up from the standard set up that may be > applicable: > > 1) In nutch-site.xml the file.content.limit and http.content.limit are set > to 6000000. > 2) I have a custom create webpage table sql script that creates fields > that can hold more. The default table fields are not sufficiently large in > most real world situations. http://nlp.solutions.asia/?p=180 > > I crawled http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/ and it > successfully parsed all except one of the PDFs, v29n3.pdf. That PDF is > almost 20 megs much larger than the limit in nutch-default.xml and even > larger than that configured in my nutch-site.xml. Interestingly that PDF is > also completely pictures (what looks like text is actually pictures of > text) so there may be no real text to parse. > > James > > ________________________________________ > From: Julien Nioche [lists.digitalpeb...@gmail.com] > Sent: Wednesday, October 17, 2012 4:17 PM > To: user@nutch.apache.org > Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files > > trimmed content? > > On 16 October 2012 22:47, kiran chitturi <chitturikira...@gmail.com> > wrote: > > > Hi, > > > > I am running Nutch 2.x with patch here at > > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a > mysql > > database. > > > > After the {inject, generate, fetch} commands when i issue the command (sh > > bin/nutch parse 1350396627-126726428) the parserJob was success but when > i > > look inside the database only one pdf file is parsed out of 10. > > > > When i look in to hadoop.log it shows the statement '2012-10-16 > > 16:04:30,682 WARN parse.ParseUtil - Unable to successfully parse content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type > > application/pdf' like this. > > > > The logs of successfully parsed and failed ones are below. The logs below > > show that pdf file '......./agosto.pdf' is parsed and the file > > '..../authors.pdf' is not parsed. > > > > The same thing happened for all other pdf files, the parse failed. When i > > do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf > > files and it does not show any errors. > > > > > > 2012-10-16 16:04:28,150 INFO parse.ParserJob - Parsing > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf > > > 2012-10-16 16:04:28,151 INFO parse.ParserFactory - The parsing > plugins: > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > > plugin.includes system property, and all claim to support the content > > type > > > application/pdf, but they are not mapp > > > ed to it in the parse-plugins.xml file > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > content-type application/pdf > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:modified 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > meta:creation-date 2010-10-20T21:12:47Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > meta:save-date 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > last-modified 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > dc:creator Denise E. Agosto > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:created 2010-10-20T21:12:47Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > creation-date 2010-10-20T21:12:47Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > date > > > 2010-10-20T21:12:47Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > xmp:creatortool ScanWizard 5 > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > modified 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > creator Denise E. Agosto > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > author Denise E. Agosto > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > xmptpg:npages 4 > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > meta:author Denise E. Agosto > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > created Wed Oct 20 17:12:47 EDT 2010 > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > producer Adobe Acrobat 9.4 Paper Capture Plug-in > > > 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag : > > > last-save-date 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag : > > > dc:title ALAN v29n3 - Facilitating Student Connections to Judith Ortiz > > > Cofer's The Line of the Sun and Esmeralda Santiago's Almost a Woman > > > 2012-10-16 16:04:30,631 INFO parse.ParserJob - Parsing > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf > > > 2012-10-16 16:04:30,680 WARN parse.MetaTagsParser - Found meta tag : > > > content-type application/pdf > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > meta:creation-date 2010-10-20T21:00:15Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:modified 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > meta:save-date 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > last-modified 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:created 2010-10-20T21:00:15Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > creation-date 2010-10-20T21:00:15Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > date > > > 2010-10-20T21:00:15Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > xmp:creatortool ScanWizard 5 > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > modified 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > xmptpg:npages 1 > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > created Wed Oct 20 17:00:15 EDT 2010 > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > producer Adobe Acrobat 9.4 Paper Capture Plug-in > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > last-save-date 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > dc:title ALAN v29n3 - INSTRUCTIONS FOR AUTHORS > > > 2012-10-16 16:04:30,682 WARN parse.ParseUtil - Unable to successfully > > > parse content > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type > > > application/pdf > > > 2012-10-16 16:04:30,692 INFO parse.ParserJob - Parsing > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf > > > > > > > Is there any way i can get more logs about knowing whether the error is > > file specific or error from internal parser ? > > > > Thank you, > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi