On 5/31/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote: > I am crawling pages using the following commands in a loop iterating 10 > times:- > > bin/nutch generate crawl/crawldb crawl/segments -topN 1000 > seg1=`ls -d crawl/segments/* | tail -1` > bin/nutch fetch $seg1 -threads 50 > bin/nutch updatedb crawl/crawldb $seg1 > > I am getting the following errors whenever it tries to parse non-HTML content. > > Error parsing: http://policydep/cmm.pdf: failed(2,200): > org.apache.nutch.parse.ParseException: parser not found for > contentType=application/pdf
Add plugin parse-pdf to your config (plugin.includes property). > > How can I make it parse these type of content while crawling? > > And if I run the fetch in non-parsing mode how can I make it parse > them later and update it in "crawl" folder. > > Please help. > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
