[Nutch-general] 0.8.x Crawler compared to 0.7.2 Crawler

Gaurav Agarwal Tue, 27 Mar 2007 12:12:13 -0800

Hi everyone,

I am part of a academic research project which involves mining web-structure
to identify social links between organizations. I have been evaluating Nutch
to fill in the web-crawling task in the application stack. I have a few
questions regarding it, and I would greatly appreciate if someone could
answer them or point me to the answers.


I read a few tutorials on the net and found that Nutch's (0.7.x)
IWebDBReader provides API's to get all the crawled pages (by URL/MD5) and to
get incoming links to and outgoing links from a particular URL. This is
great as this was precisely the functionality I was looking for to make
web-link graph and mine information out of it.  In addition to this, a
highly simple plugin architecture used in Nutch made it look very very
attractive. However, after working with a couple of hours with release
0.8.1, I realized that these API's are no longer supported by the
WebDBReader (only the incoming links to is supported by 0.8.1). This has
left me wondering about the version I should be using for my project.

Definitely the advantage with 0.8.x is that it models all most every
operation as  a Map-Reduce calls (which is amazing!), and therefore is much
more scalable; but in the absence of the API's mentioned above it does not
provide me much help to build the web-link graph from the crawler output.

I may be completely wrong here and please correct me if I am, it looks like
post 0.8.0 release the thrust has been to develop the Nutch project
completely as an indexing library/application and the crawl module itself
loosing its independence or decoupling. With 0.8.x, the crawl output in
itself does not give much of useful information (or at least I failed to
locate such API's).

I'll rephrase my concerns as concrete questions:

1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the
information about crawled data like : get all pages(contents) given a
URL/md5, get outgoing links from a URL and get all incoming links to a
URL(this last API is provided; i mentioned it for the sake of completeness).
Or an easy way I can improvise these API's.

2) If answer to 1 is NO, are there any plans to add these functionality back
in the forthcoming releases.

3) If answer to both 1 and 2 is NO, can someone point me to the discussions
which explains the rationale behind making these changes to the interface
which (in my opinion) leaves the crawler module slightly weakened ( I tried
scanning the forum posts till the era when 0.7.2 was released but failed to
locate any such discussion).

As, I mentioned earlier, I have very recently started using Nutch and many
of my thoughts might be irrelevant or even completely wrong; please excuse
me for them.

Thanks in advance!

Regards,
Gaurav
-- 
View this message in context: 
http://www.nabble.com/0.8.x-Crawler-compared-to-0.7.2-Crawler-tf3475330.html#a9700124
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] 0.8.x Crawler compared to 0.7.2 Crawler

Reply via email to