Thank you for your suggesiton. If we filter pages in the index step, it will
also caurse storage consumption for trash pages to me. Anyway, for a
intranet crawl, maybe it's tolerant.

Sure there are alternatives but that require to write more code.

A past thread has mentioned that we can use only the fetcher of nutch to achieve some tasks. So is it possible to use the fetcher iteratively until
we find the required links and then sotre and index them?
Sure.
However if you take a look how the fectoutput is written to disk you can use any tool to create the content. But if you are at this situation than I suggest use your tool and lucene not nutch.


You wrote:

The way I go is that I index such pages anyway but 'tag' them. So I

use a index filter for that and tag the positive pages with a other tag.
Like this category:trash or category:nugget.
Than I also use a querfilter plugin and in the ui I extend my query:
queryString+ " category:nugget"
So you will have only non trash pages in your results. I guess you
can also use the prune tool to remove such trash pages the index if
you like.
HTH
Stefan


Am 14.02.2006 um 08:11 schrieb Elwin:


2006/2/14, Elwin <[EMAIL PROTECTED]>:

When using nutch to crawl some sites, I want to index fetched contents selectively only when the urls to these contents fit my filter, for other
urls I just want nutch to crawl them and parse them without index.
How can I achieve this? Which extension point should I extend?


---------------------------------------------
George Orwel was an Optimist
blog: http://www.find23.org
company: http://www.media-style.com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to