Re: [Nutch-general] indexing only special documents

Briggs Wed, 06 Jun 2007 20:10:08 -0700

All the plugins are in the nutch source distribution and are found in:
<NUTCH_INDSTALL_DIR>/src/plugins


There is nothing that really provides near real-time statistics other
than the logging.  I am planning on writing a few analysis plugins,
perhaps just using aspects, to allow a jmx client monitor the process
(and trying to not be too invasive to affect performance).   I haven't
done it yet, but I don't see plugin creation "too difficult" (if you
are comfortable with parsing).

There are some processes that you could run that can dump metadata and
other useful info for looking into your segments and url databases.
just run:

 <NUTCH_INSTALL_DIR>/bin/nutch

It will show you the options to run for reading the data.  You can
find out how many urls were successfully fetched, how many failed and
total number of urls etc.  Look at the nutch 0.8 wiki entry
http://wiki.apache.org/nutch/08CommandLineOptions .  It just shows the
shell output for the nutch options to run.  It will give you and idea
of what is available.

For finding how many documents were fetched of specific types you
would be better off just using the search bean and basically, using
lucene to find out those things.  Otherwise you would have to  write
your own implementation to read the data.

I am learning more about nutch everyday so, I can't claim everything I
have said is 100% correct.


On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote:
> Wow thx Briggs that's pretty cool and it looks easy :) great!! I will try this
> out right tomorrow..bit late now here.
>
> Another 2 additonal questions:
>
> 1.Those "parse" plugins where do I find them in the nutch source code? Is it
> possible and easy going to write a own parser plugin...cause I think I'm gonna
> need some additional non standard parser plugin(s).
>
> 2. When I do a crawl. Is it possible that I can activate or see some 
> statistics
> in nutch for that. I mean that at the end of indexing process it shows me how
> many urls nutch had parsed and how much of them contained i.e. pdfs and
> additionally how long the crawling and indexing process tooked and so on?
>
> thx for support
> martin
>
>
>
> Zitat von Briggs <[EMAIL PROTECTED]>:
>
> > You set that up in your nutch-site.xml file. Open the
> > nutch-default.xml file (located in the <NUTCH_INSTALL_DIR>/conf. Look
> > for this element:
> >
> > <property>
> >   <name>plugin.includes</name>
> >
> <value>protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >   <description>Regular expression naming plugin directory names to
> >   include.  Any plugin not matching this expression is excluded.
> >   In any case you need at least include the nutch-extensionpoints plugin. By
> >   default Nutch includes crawling just HTML and plain text via HTTP,
> >   and basic indexing and search plugins. In order to use HTTPS please enable
> >   protocol-httpclient, but be aware of possible intermittent problems with
> > the
> >   underlying commons-httpclient library.
> >   </description>
> > </property>
> >
> >
> > You'll notice the "parse" plugins that uses the regex
> > "parse-(text|html|pdf|msword|rss)".  You remove/add the available
> > parsers here. So, if you only wanted pdfs, you only use the pdf
> > parser, "parse-(pdf)" or just "parse-pdf".
> >
> > Don't edit the nutch-default file. Create a new nutch-site.xml file
> > for your cusomizations.  So, basically copy the nutch-default.xml
> > file, remove everything you don't need to override, and there ya go.
> >
> > I believe that is the correct way.
> >
> >
> > On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]>
> > wrote:
> > >
> > >
> > > hi!
> > >
> > > I have a question. If I have for example the seed urls and do a crawl 
> > > based
> > o
> > > that seeds. If I want to index then only pages that contain for example 
> > > pdf
> > > documents, how can I do that?
> > >
> > > cheers
> > > martin
> > >
> > >
> > >
> >
> >
> > --
> > "Conscious decisions by conscious minds are what make reality real"
> >
>
>
>
>


-- 
"Conscious decisions by conscious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] indexing only special documents

Reply via email to