Re: [Nutch-general] How to parse PDF files? Deferred parsing possible?

Doğacan Güney Wed, 30 May 2007 23:10:27 -0700

On 5/31/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
> I am crawling pages using the following commands in a loop iterating 10 
> times:-
>
>    bin/nutch generate crawl/crawldb crawl/segments -topN 1000
>    seg1=`ls -d crawl/segments/* | tail -1`
>    bin/nutch fetch $seg1 -threads 50
>    bin/nutch updatedb crawl/crawldb $seg1
>
> I am getting the following errors whenever it tries to parse non-HTML content.
>
> Error parsing: http://policydep/cmm.pdf: failed(2,200):
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/pdf


Add plugin parse-pdf to your config (plugin.includes property).

>
> How can I make it parse these type of content while crawling?
>
> And if I run the fetch in non-parsing mode how can I make it parse
> them later and update it in "crawl" folder.
>
> Please help.
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to parse PDF files? Deferred parsing possible?

Reply via email to