Hi,

Sorry, I know that many posts have been sent about
this subject, but I still have a problem with PDF
files in Nutch 0.8.1 :

I crawl a little place of my Intranet with the command
line above :

bin/nutch crawl urls –dir crawldir –depth 3

* PDF files are fetched :

fetching
http://my.intranet.fr/essairecherche/moinf015.pdf

* Then, they are indexed :

Indexing
[http://my.intranet.fr/essairecherche/moinf015.pdf]
with analyzer
[EMAIL PROTECTED]
(null)

In my NutchHome/conf/nutch-default.xml, I have set :

http.content.limit to -1
indexer.max.tokens to 9000000
plugin.includes to :
protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msword|mspowerpoint|msexcel)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)

I have verified that my crawl-urlfilter.txt file
doesn't contain "pdf" in the "skip" section.


However, when I search particular terms that I know
being in the pdf files that are indexed, I keep
getting no result.

What did I forget ?
What is PDFbox and does it have something to do with
my problem ? Do I need to install this ?

Thanks


        

        
                
___________________________________________________________________________ 
Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! 
Profitez des connaissances, des opinions et des expériences des internautes sur 
Yahoo! Questions/Réponses 
http://fr.answers.yahoo.com

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to