Hi everyone, I am part of a academic research project which involves mining web-structure to identify social links between organizations. I have been evaluating Nutch to fill in the web-crawling task in the application stack. I have a few questions regarding it, and I would greatly appreciate if someone could answer them or point me to the answers.
I read a few tutorials on the net and found that Nutch's (0.7.x) IWebDBReader provides API's to get all the crawled pages (by URL/MD5) and to get incoming links to and outgoing links from a particular URL. This is great as this was precisely the functionality I was looking for to make web-link graph and mine information out of it. In addition to this, a highly simple plugin architecture used in Nutch made it look very very attractive. However, after working with a couple of hours with release 0.8.1, I realized that these API's are no longer supported by the WebDBReader (only the incoming links to is supported by 0.8.1). This has left me wondering about the version I should be using for my project. Definitely the advantage with 0.8.x is that it models all most every operation as a Map-Reduce calls (which is amazing!), and therefore is much more scalable; but in the absence of the API's mentioned above it does not provide me much help to build the web-link graph from the crawler output. I may be completely wrong here and please correct me if I am, it looks like post 0.8.0 release the thrust has been to develop the Nutch project completely as an indexing library/application and the crawl module itself loosing its independence or decoupling. With 0.8.x, the crawl output in itself does not give much of useful information (or at least I failed to locate such API's). I'll rephrase my concerns as concrete questions: 1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the information about crawled data like : get all pages(contents) given a URL/md5, get outgoing links from a URL and get all incoming links to a URL(this last API is provided; i mentioned it for the sake of completeness). Or an easy way I can improvise these API's. 2) If answer to 1 is NO, are there any plans to add these functionality back in the forthcoming releases. 3) If answer to both 1 and 2 is NO, can someone point me to the discussions which explains the rationale behind making these changes to the interface which (in my opinion) leaves the crawler module slightly weakened ( I tried scanning the forum posts till the era when 0.7.2 was released but failed to locate any such discussion). As, I mentioned earlier, I have very recently started using Nutch and many of my thoughts might be irrelevant or even completely wrong; please excuse me for them. Thanks in advance! Regards, Gaurav -- View this message in context: http://www.nabble.com/0.8.x-Crawler-compared-to-0.7.2-Crawler-tf3475330.html#a9700124 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
