RE: Crawling PDF

ahmed.ridha Wed, 23 Mar 2011 11:02:04 -0700


Hello everybody,


I need a help in my Nutch configuration , I want to crawl the PDF's index .

I tried to use the Guid configuration but not success , hier are important Part 
of my Cods ::

_____________________________________
Crawl-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME

# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

+^http://\S*localhost:8080/examples/jsp/


# skip everything else
-.

___________________________________________________________________

Plugin.xml



   <runtime>
      <library name="parse-pdf.jar">
         <export name="*"/>
      </library>
      <library name="PDFBox-0.7.4-dev.jar"/>
      <library name="FontBox-0.2.0-dev.jar"/>
      <library name="JempBox-0.2.0-dev.jar"/>
      <library name="bcprov-jdk14-132.jar"/>
      <!-- Uncomment the following two lines after you have downloaded the
           libraries, see README.txt for more details.-->

      <library name="jai_codec.jar"/>
      <library name="jai_core.jar"/>
        </runtime>

__________________________________________________________________


Regex-urlfilter.txt


# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
________________________________________________________________________


Nutch-site.xml

<property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>protocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</description>
  </property>
___________________________________________________________________________


And the 2 Libraries (jar files ) are copied in the src/plugin/parse-pdf Dir .

Please Help , and thanks in Advance

























This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information.  If you have received it in 
error, please notify the sender immediately and delete the original.  Any other 
use of the email by you is prohibited.

RE: Crawling PDF

Reply via email to