Gaurav Agarwal wrote: > Hi everyone, >> Definitely the advantage with 0.8.x is that it models all most every > operation as a Map-Reduce calls (which is amazing!), and therefore is much > more scalable; but in the absence of the API's mentioned above it does not > provide me much help to build the web-link graph from the crawler output.
There is a similar API for reading from the DB, which is called CrawlDbReader. It is relatively simple compared to WebDBReader, because most of the support is already provided by Hadoop (i.e. the map-reduce framework). In 0.8 and later the information about pages and information about links are split into two different DB-s - crawldb and linkdb - but exactly the same information can be obtained from them as before. > > I may be completely wrong here and please correct me if I am, it looks like > post 0.8.0 release the thrust has been to develop the Nutch project > completely as an indexing library/application and the crawl module itself > loosing its independence or decoupling. With 0.8.x, the crawl output in > itself does not give much of useful information (or at least I failed to > locate such API's). That's not the case - if anything, the amount of useful information you can retrieve has tremendously increased. Please see all the tools available through the bin/nutch script, and prefixed with read* - and then look at their implementation for inspiration. > > I'll rephrase my concerns as concrete questions: > > 1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the > information about crawled data like : get all pages(contents) given a Fetched pages are stored in segments. Please see SegmentReader tool that allows you to retrieve the segment content. > URL/md5, get outgoing links from a URL and get all incoming links to a SegmentReader as above. For incoming links use the linkdb, and LinkDbReader. > URL(this last API is provided; i mentioned it for the sake of completeness). > Or an easy way I can improvise these API's. > > 2) If answer to 1 is NO, are there any plans to add these functionality back > in the forthcoming releases. > > 3) If answer to both 1 and 2 is NO, can someone point me to the discussions > which explains the rationale behind making these changes to the interface > which (in my opinion) leaves the crawler module slightly weakened ( I tried > scanning the forum posts till the era when 0.7.2 was released but failed to > locate any such discussion). Please see above. The answer is yes. ;) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
