[Nutch-general] Re: How to hack the config?

Kristan Uccello Wed, 30 Nov 2005 10:16:03 -0800

Hi Matt,

I can see you and I are thinking along similar lines as I did think toadd a custom field during indexing. However I rejected this thought asit does not scale very well. For example, say I did add a custom fieldduring indexing (call it setID) and for site foo.com I had a set ofsites to index (a.com, b.com, c.com [setID=1]) this would work great butI would also have bar.com which needed to index sites as well (a.com,d.com, e.com [setID=2]) now in this case (a.com) would be indexed twice(once for each setID) or if remove duplicates was used it may lose itsorigonal identity (setID=1) in place of its new identity (setID=2) thiswould mean that setID=1 would lose its index of site a.com.

This is very close to a many-to-many relationship problem encounteredregualarly in database design. I'm looking for how to achive thisrelationship inside the rules of nutch.


Cheers,

Kristan Uccello

Matt Kangas wrote:

(i'm moving this to nutch-user, so we don't piss off the nutch-devfolks.)
a few ideas:
- if you only want to match one site at a time, you can just add"site:xxx" to the query. the "site" field exists in the index by default- if you want assign ids to clusters of sites, you can do the site->id lookup at index time and add a custom field to the index
if you want to do the latter, you need to write custom indexer +query plugins. it's not hard, but the documentation is somewhat poor.
i'd suggest looking at the source for the plugins "index-more" and"query-more". those two plugins add support for "type:" and "date:"queries.
--matt

On Nov 29, 2005, at 7:16 PM, Kristan Uccello wrote:
Hi Matt,
thank you for the reply I will play with what you sugested as itsounds pretty close to what I am after. The only caviet is that allthe websites that I will be hosting are running from one source so Iam after a way that I can dynamically inject a filter based on whatsite is calling the searcher. I plan to have nutch running on oneweb server and access it through its rss REST and my thought was toprovide an additional REST parameter like ...&callerID=xxx and havenutch load the appropriate class that would filter the results to alist of domains (stored in a file or database or something) based onthe callerID value as the key.
I'm pretty new to nutch ( only working with it for a few days ) soany guideance you can give me will be greatly appreciated.
Cheers,

Kristan Uccello

Matt Kangas wrote:
(note: this is probably more relevant to nutch-user. please sendreplies there.)
This question seems to come up periodically.
Personally, I accomplish this via a custom URLFilter that uses aMapFile of regex pattern-lists, e.g. one set of regexes perwebsite. You can find the code in http://issues.apache.org/jira/browse/NUTCH-87
All this does is allow you to keep track of a large set ofregexes, partitioned by site. It's useful if you want anextremely-focused crawl, possibly burrowing through CGIs.
If instead you want to crawl entire sites, but ignoring CGIs isOK, then PrefixURLFilter is the easiest answer. Create a newline-delimited text file with the site URLs and use this as both seedurls (nutch inject -urlfile) and as the prefixurlfilter configfile (set "urlfilter.prefix.file" in nutch-site.xml).
HTH,
--Matt

On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
Hello all,
I am attempting to modify the RegexUrlFilter and/or theNutchConfig so that I may dynamically apply a set of domain namesto fetcher.
In the FAQ:
>>Is it possible to fetch only pages from some specificdomains?
>>Please have a look on PrefixURLFilter. Adding some regularexpressions to the urlfilter.regex.file might work, but adding alist with thousands of regular expressions would slow down yoursystem excessively.
I wish to be able to provide a list of urls that I want to havefetchedand I want the fetcher to only fetch from those sites (notfollow any links out of those sites) I would like to be able tokeep adding to this list without having to modify the nutch-config.xml each time but instead just add it to the config (orother object) in memory. All I am after is a point in the rightdirection. If someone could tell me if I am looking in the wrongfiles (or off my rocker!) please let me know where I could/ should go.
The reason I am asking this is that I am working on a "roll yourown search". I want to be able to crawl specific sites only, andthen, in the search results, get search results pertaining onlyto some subset of those crawled sites.
Best regards,

Kristan Uccello
--
Matt Kangas / [EMAIL PROTECTED]
--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: How to hack the config?

Reply via email to