Re: [Nutch-general] indexing only special documents

Martin Kammerlander Wed, 06 Jun 2007 15:53:02 -0700

Wow thx Briggs that's pretty cool and it looks easy :) great!! I will try this
out right tomorrow..bit late now here.


Another 2 additonal questions:

1.Those "parse" plugins where do I find them in the nutch source code? Is it
possible and easy going to write a own parser plugin...cause I think I'm gonna
need some additional non standard parser plugin(s).

2. When I do a crawl. Is it possible that I can activate or see some statistics
in nutch for that. I mean that at the end of indexing process it shows me how
many urls nutch had parsed and how much of them contained i.e. pdfs and
additionally how long the crawling and indexing process tooked and so on?

thx for support
martin



Zitat von Briggs <[EMAIL PROTECTED]>:

> You set that up in your nutch-site.xml file. Open the
> nutch-default.xml file (located in the <NUTCH_INSTALL_DIR>/conf. Look
> for this element:
>
> <property>
>   <name>plugin.includes</name>
> 
<value>protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
>
> You'll notice the "parse" plugins that uses the regex
> "parse-(text|html|pdf|msword|rss)".  You remove/add the available
> parsers here. So, if you only wanted pdfs, you only use the pdf
> parser, "parse-(pdf)" or just "parse-pdf".
>
> Don't edit the nutch-default file. Create a new nutch-site.xml file
> for your cusomizations.  So, basically copy the nutch-default.xml
> file, remove everything you don't need to override, and there ya go.
>
> I believe that is the correct way.
>
>
> On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]>
> wrote:
> >
> >
> > hi!
> >
> > I have a question. If I have for example the seed urls and do a crawl based
> o
> > that seeds. If I want to index then only pages that contain for example pdf
> > documents, how can I do that?
> >
> > cheers
> > martin
> >
> >
> >
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] indexing only special documents

Reply via email to