Simon Detri wrote:
Hello,
After the crawl is done, I would like to query the webdb for pages (by url), and i would like to access the content of these pages.
I see that there is a method WebDBReader.getUrl(String url) which returns a Page. Is there a way to get the recno of this Page so that i can retrieve the Content by doing something like this:
// code from net.nutch.protocol.Content.java File file = new File(segment, DIR_NAME); ArrayFile.Reader contents = new ArrayFile.Reader(file.toString()); Content content = new Content(); contents.get(recno, content);
The quickest way is to build an index with the IndexSegment tool, then you can find recno by searching for URL. That's how the NutchBean is able to retrieve copies of Content. Note however that you can have the same url in many segments, pointing to the same or different content (e.g. different versions). Ordinarily, after creating segment indexes you would run DeleteDuplicates to prune segments' data.
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
