Hi Matt,
I can see you and I are thinking along similar lines as I did think to
add a custom field during indexing. However I rejected this thought as
it does not scale very well. For example, say I did add a custom field
during indexing (call it setID) and for site foo.com I had a set of
sites to index (a.com, b.com, c.com [setID=1]) this would work great but
I would also have bar.com which needed to index sites as well (a.com,
d.com, e.com [setID=2]) now in this case (a.com) would be indexed twice
(once for each setID) or if remove duplicates was used it may lose its
origonal identity (setID=1) in place of its new identity (setID=2) this
would mean that setID=1 would lose its index of site a.com.
This is very close to a many-to-many relationship problem encountered
regualarly in database design. I'm looking for how to achive this
relationship inside the rules of nutch.
Cheers,
Kristan Uccello
Matt Kangas wrote:
(i'm moving this to nutch-user, so we don't piss off the nutch-dev
folks.)
a few ideas:
- if you only want to match one site at a time, you can just add
"site:xxx" to the query. the "site" field exists in the index by default
- if you want assign ids to clusters of sites, you can do the site-
>id lookup at index time and add a custom field to the index
if you want to do the latter, you need to write custom indexer +
query plugins. it's not hard, but the documentation is somewhat poor.
i'd suggest looking at the source for the plugins "index-more" and
"query-more". those two plugins add support for "type:" and "date:"
queries.
--matt
On Nov 29, 2005, at 7:16 PM, Kristan Uccello wrote:
Hi Matt,
thank you for the reply I will play with what you sugested as it
sounds pretty close to what I am after. The only caviet is that all
the websites that I will be hosting are running from one source so I
am after a way that I can dynamically inject a filter based on what
site is calling the searcher. I plan to have nutch running on one
web server and access it through its rss REST and my thought was to
provide an additional REST parameter like ...&callerID=xxx and have
nutch load the appropriate class that would filter the results to a
list of domains (stored in a file or database or something) based on
the callerID value as the key.
I'm pretty new to nutch ( only working with it for a few days ) so
any guideance you can give me will be greatly appreciated.
Cheers,
Kristan Uccello
Matt Kangas wrote:
(note: this is probably more relevant to nutch-user. please send
replies there.)
This question seems to come up periodically.
Personally, I accomplish this via a custom URLFilter that uses a
MapFile of regex pattern-lists, e.g. one set of regexes per
website. You can find the code in http://issues.apache.org/jira/
browse/NUTCH-87
All this does is allow you to keep track of a large set of
regexes, partitioned by site. It's useful if you want an
extremely-focused crawl, possibly burrowing through CGIs.
If instead you want to crawl entire sites, but ignoring CGIs is
OK, then PrefixURLFilter is the easiest answer. Create a newline-
delimited text file with the site URLs and use this as both seed
urls (nutch inject -urlfile) and as the prefixurlfilter config
file (set "urlfilter.prefix.file" in nutch-site.xml).
HTH,
--Matt
On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
Hello all,
I am attempting to modify the RegexUrlFilter and/or the
NutchConfig so that I may dynamically apply a set of domain names
to fetcher.
In the FAQ:
>>Is it possible to fetch only pages from some specific
domains?
>>Please have a look on PrefixURLFilter. Adding some regular
expressions to the urlfilter.regex.file might work, but adding a
list with thousands of regular expressions would slow down your
system excessively.
I wish to be able to provide a list of urls that I want to have
fetchedand I want the fetcher to only fetch from those sites (not
follow any links out of those sites) I would like to be able to
keep adding to this list without having to modify the nutch-
config.xml each time but instead just add it to the config (or
other object) in memory. All I am after is a point in the right
direction. If someone could tell me if I am looking in the wrong
files (or off my rocker!) please let me know where I could/ should go.
The reason I am asking this is that I am working on a "roll your
own search". I want to be able to crawl specific sites only, and
then, in the search results, get search results pertaining only
to some subset of those crawled sites.
Best regards,
Kristan Uccello
--
Matt Kangas / [EMAIL PROTECTED]
--
Matt Kangas / [EMAIL PROTECTED]
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general