Hi,
I think it is an interesting idea but from technical pespective decision
to use HiveMind or Spring should be taken for the whole project in my
opinion. The same goes for JDK 5.0. So right now it is not a best match
for Nutch.
On the functionality side I am not the best person to judge it as I am
doing rather big crawls with many hosts, but it sounds interesting.
Regards,
Piotr
Erik Hatcher wrote:
Kelvin,
Big +1!!! I'm working on focused crawling as well, and your work fits
well with my needs.
An implementation detail - have you considered using HiveMind rather
than Spring? This would be much more compatible license-wise with
Nutch and be easier to integrate into the ASF repository. Further - I
wonder if the existing plugin mechanism would work well as a
HiveMind-based system too.
Erik
On Aug 23, 2005, at 12:02 AM, Kelvin Tan wrote:
I've been working on some changes to crawling to facilitate its use
as a non-whole-web crawler, and would like to gauge interest on this
list about including it somewhere in the Nutch repo, hopefully before
the map-red brance gets merged in.
It is basically a partial re-write of the whole fetching mechanism,
borrowing large chunks of code here and there.
Features include:
- Customizable seed inputs, i.e. seed a crawl from a file, database,
Nutch FetchList, etc
- Customizable crawl scopes, e.g. crawl the seed URLs and only the
urls within their domains. (this can already be manually accomplished
with RegexURLFilter, but what if there are 200,000 seed URLs?), or
crawl seed url domains + 1 external link (not possible with current
filter mechanism)
- Online fetchlist building (as opposed to Nutch’s offline method),
and customizable strategies for building a fetchlist. The default
implementation gives priority to hosts with a larger number of pages
to crawl. Note that offline fetchlist building is ok too.
- Runs continuously until all links are crawled
- Customizable fetch output mechanisms, like output to file, to
WebDB, or even not at all (if we’re just implementing a link- checker,
for example)
- Fully utilizes HTTP 1.1 connection persistence and request pipelining
It is fully compatible with Nutch as it is, i.e. given a Nutch
fetchlist, the new crawler can produce a Nutch segment. However, if
you don’t need that at all, and are just interested in Nutch as a
crawler, then that’s ok too!
It is a drop-in replacement for the Nutch crawler, and compiles with
the recently released 0.7 jar.
Some disclaimers:
It was never designed to be a superset replacement for the Nutch
crawler. Rather, it is tailored to fairly specific requirements of
what I believe is called constrained crawling. It uses Spring
Framework (for easy customization of implementation classes) and JDK
5 features (occasional new loop syntax, autoboxing, generics, etc).
These 2 points speeded up dev. but probably make it a untasty Nutch
acquisition.. ;-) But it shouldn't be tough to do something about that..
One of the areas the Nutch Crawler can use with improvement is in the
fact that its really difficult to extend and customize. With the
addition of interfaces and beans, its possible for developers to
develop their own mechanism for fetchlist prioritization, or use a
B-Tree as the backing implementation of the database of crawled URLs.
I'm using Spring to make it easy to change implementation, and make
loose coupling easy..
There are some places where existing Nutch functionality is
duplicated in some way to allow for slight modifications as opposed
to patching the Nutch classes. The rationale behind this approach was
to simplify integration - much easier to have Our Crawler as a
separate jar which depends on the Nutch jar. Furthermore if it
doesn't get accepted into Nutch, no rewriting or patching of Nutch
sources needs to be done.
Its my belief that if you're using Nutch for anything but whole-web
crawling and need to make even small changes to the way the crawling
is performed, you'll find Our Crawler helpful.
I consider current code as beta quality. I've run it on smallish
crawls (200k+ URLs) and things seem to be working ok, but nowhere
near production quality.
Some related blog entries:
Improving Nutch for constrained crawls
http://www.supermind.org/index.php?p=274
Reflections on modifying the Nutch crawler
http://www.supermind.org/index.php?p=283
Limitations of OC
http://www.supermind.org/index.php?p=284
Even if we decide not to include in Nutch repo, the code will still
be released under APL. I'm in the process of adding abit more
documentation, and a shell script for running, and will release the
files over the next couple days.
Cheers,
Kelvin
http://www.supermind.org
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers