You set that up in your nutch-site.xml file. Open the
nutch-default.xml file (located in the <NUTCH_INSTALL_DIR>/conf. Look
for this element:

<property>
  <name>plugin.includes</name>
 
<value>protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>


You'll notice the "parse" plugins that uses the regex
"parse-(text|html|pdf|msword|rss)".  You remove/add the available
parsers here. So, if you only wanted pdfs, you only use the pdf
parser, "parse-(pdf)" or just "parse-pdf".

Don't edit the nutch-default file. Create a new nutch-site.xml file
for your cusomizations.  So, basically copy the nutch-default.xml
file, remove everything you don't need to override, and there ya go.

I believe that is the correct way.


On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote:
>
>
> hi!
>
> I have a question. If I have for example the seed urls and do a crawl based o
> that seeds. If I want to index then only pages that contain for example pdf
> documents, how can I do that?
>
> cheers
> martin
>
>
>


-- 
"Conscious decisions by conscious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to