Hi, Sorry, I know that many posts have been sent about this subject, but I still have a problem with PDF files in Nutch 0.8.1 :
I crawl a little place of my Intranet with the command line above : bin/nutch crawl urls dir crawldir depth 3 * PDF files are fetched : fetching http://my.intranet.fr/essairecherche/moinf015.pdf * Then, they are indexed : Indexing [http://my.intranet.fr/essairecherche/moinf015.pdf] with analyzer [EMAIL PROTECTED] (null) In my NutchHome/conf/nutch-default.xml, I have set : http.content.limit to -1 indexer.max.tokens to 9000000 plugin.includes to : protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msword|mspowerpoint|msexcel)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr) I have verified that my crawl-urlfilter.txt file doesn't contain "pdf" in the "skip" section. However, when I search particular terms that I know being in the pdf files that are indexed, I keep getting no result. What did I forget ? What is PDFbox and does it have something to do with my problem ? Do I need to install this ? Thanks ___________________________________________________________________________ Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! Profitez des connaissances, des opinions et des expériences des internautes sur Yahoo! Questions/Réponses http://fr.answers.yahoo.com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
