[Nutch-dev] Re: Fetcher for constrained crawls

Piotr Kosiorowski Tue, 23 Aug 2005 03:50:34 -0700

Hi,

I think it is an interesting idea but from technical pespective decisionto use HiveMind or Spring should be taken for the whole project in myopinion. The same goes for JDK 5.0. So right now it is not a best match

for Nutch.

On the functionality side I am not the best person to judge it as I amdoing rather big crawls with many hosts, but it sounds interesting.


Regards,
Piotr



Erik Hatcher wrote:

Kelvin,
Big +1!!! I'm working on focused crawling as well, and your work fitswell with my needs.
An implementation detail - have you considered using HiveMind ratherthan Spring? This would be much more compatible license-wise withNutch and be easier to integrate into the ASF repository. Further - Iwonder if the existing plugin mechanism would work well as aHiveMind-based system too.
    Erik

On Aug 23, 2005, at 12:02 AM, Kelvin Tan wrote:
I've been working on some changes to crawling to facilitate its useas a non-whole-web crawler, and would like to gauge interest on thislist about including it somewhere in the Nutch repo, hopefully beforethe map-red brance gets merged in.
It is basically a partial re-write of the whole fetching mechanism,borrowing large chunks of code here and there.
Features include:
- Customizable seed inputs, i.e. seed a crawl from a file, database,Nutch FetchList, etc- Customizable crawl scopes, e.g. crawl the seed URLs and only theurls within their domains. (this can already be manually accomplishedwith RegexURLFilter, but what if there are 200,000 seed URLs?), orcrawl seed url domains + 1 external link (not possible with currentfilter mechanism)- Online fetchlist building (as opposed to Nutch’s offline method),and customizable strategies for building a fetchlist. The defaultimplementation gives priority to hosts with a larger number of pagesto crawl. Note that offline fetchlist building is ok too.
- Runs continuously until all links are crawled
- Customizable fetch output mechanisms, like output to file, toWebDB, or even not at all (if we’re just implementing a link- checker,for example)
- Fully utilizes HTTP 1.1 connection persistence and request  pipelining
It is fully compatible with Nutch as it is, i.e. given a Nutchfetchlist, the new crawler can produce a Nutch segment. However, ifyou don’t need that at all, and are just interested in Nutch as acrawler, then that’s ok too!
It is a drop-in replacement for the Nutch crawler, and compiles withthe recently released 0.7 jar.
Some disclaimers:
It was never designed to be a superset replacement for the Nutchcrawler. Rather, it is tailored to fairly specific requirements ofwhat I believe is called constrained crawling. It uses SpringFramework (for easy customization of implementation classes) and JDK5 features (occasional new loop syntax, autoboxing, generics, etc).These 2 points speeded up dev. but probably make it a untasty Nutchacquisition.. ;-) But it shouldn't be tough to do something about that..
One of the areas the Nutch Crawler can use with improvement is in thefact that its really difficult to extend and customize. With theaddition of interfaces and beans, its possible for developers todevelop their own mechanism for fetchlist prioritization, or use aB-Tree as the backing implementation of the database of crawled URLs.I'm using Spring to make it easy to change implementation, and makeloose coupling easy..
There are some places where existing Nutch functionality isduplicated in some way to allow for slight modifications as opposedto patching the Nutch classes. The rationale behind this approach wasto simplify integration - much easier to have Our Crawler as aseparate jar which depends on the Nutch jar. Furthermore if itdoesn't get accepted into Nutch, no rewriting or patching of Nutchsources needs to be done.
Its my belief that if you're using Nutch for anything but whole-webcrawling and need to make even small changes to the way the crawlingis performed, you'll find Our Crawler helpful.
I consider current code as beta quality. I've run it on smallishcrawls (200k+ URLs) and things seem to be working ok, but nowherenear production quality.
Some related blog entries:

Improving Nutch for constrained crawls
http://www.supermind.org/index.php?p=274

Reflections on modifying the Nutch crawler
http://www.supermind.org/index.php?p=283

Limitations of OC
http://www.supermind.org/index.php?p=284
Even if we decide not to include in Nutch repo, the code will stillbe released under APL. I'm in the process of adding abit moredocumentation, and a shell script for running, and will release thefiles over the next couple days.
Cheers,
Kelvin

http://www.supermind.org





-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Fetcher for constrained crawls

Reply via email to