Hello everybody,
I need a help in my Nutch configuration , I want to crawl the PDF's index . I tried to use the Guid configuration but not success , hier are important Part of my Cods :: _____________________________________ Crawl-urlfilter.txt # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept hosts in MY.DOMAIN.NAME # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ +^http://\S*localhost:8080/examples/jsp/ # skip everything else -. ___________________________________________________________________ Plugin.xml <runtime> <library name="parse-pdf.jar"> <export name="*"/> </library> <library name="PDFBox-0.7.4-dev.jar"/> <library name="FontBox-0.2.0-dev.jar"/> <library name="JempBox-0.2.0-dev.jar"/> <library name="bcprov-jdk14-132.jar"/> <!-- Uncomment the following two lines after you have downloaded the libraries, see README.txt for more details.--> <library name="jai_codec.jar"/> <library name="jai_core.jar"/> </runtime> __________________________________________________________________ Regex-urlfilter.txt # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. ________________________________________________________________________ Nutch-site.xml <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>protocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</description> </property> ___________________________________________________________________________ And the 2 Libraries (jar files ) are copied in the src/plugin/parse-pdf Dir . Please Help , and thanks in Advance This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.