Re: [Nutch-general] 0.8.x Crawler compared to 0.7.2 Crawler

Andrzej Bialecki Tue, 27 Mar 2007 13:42:20 -0800

Gaurav Agarwal wrote:
> Hi everyone,
>> Definitely the advantage with 0.8.x is that it models all most every
> operation as  a Map-Reduce calls (which is amazing!), and therefore is much
> more scalable; but in the absence of the API's mentioned above it does not
> provide me much help to build the web-link graph from the crawler output.


There is a similar API for reading from the DB, which is called 
CrawlDbReader. It is relatively simple compared to WebDBReader, because 
most of the support is already provided by Hadoop (i.e. the map-reduce 
framework).

In 0.8 and later the information about pages and information about links 
are split into two different DB-s - crawldb and linkdb - but exactly the 
same information can be obtained from them as before.


> 
> I may be completely wrong here and please correct me if I am, it looks like
> post 0.8.0 release the thrust has been to develop the Nutch project
> completely as an indexing library/application and the crawl module itself
> loosing its independence or decoupling. With 0.8.x, the crawl output in
> itself does not give much of useful information (or at least I failed to
> locate such API's).


That's not the case - if anything, the amount of useful information you 
can retrieve has tremendously increased. Please see all the tools 
available through the bin/nutch script, and prefixed with read* - and 
then look at their implementation for inspiration.


> 
> I'll rephrase my concerns as concrete questions:
> 
> 1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the
> information about crawled data like : get all pages(contents) given a

Fetched pages are stored in segments. Please see SegmentReader tool that 
allows you to retrieve the segment content.


> URL/md5, get outgoing links from a URL and get all incoming links to a

SegmentReader as above. For incoming links use the linkdb, and LinkDbReader.

> URL(this last API is provided; i mentioned it for the sake of completeness).
> Or an easy way I can improvise these API's.
> 
> 2) If answer to 1 is NO, are there any plans to add these functionality back
> in the forthcoming releases.
> 
> 3) If answer to both 1 and 2 is NO, can someone point me to the discussions
> which explains the rationale behind making these changes to the interface
> which (in my opinion) leaves the crawler module slightly weakened ( I tried
> scanning the forum posts till the era when 0.7.2 was released but failed to
> locate any such discussion).

Please see above. The answer is yes. ;)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] 0.8.x Crawler compared to 0.7.2 Crawler

Reply via email to