Hej, I am using Nutch for indexing websites and it is working well (most of the times).
I've checked that Nutch extract the outlinks from the raw HTML code of each parsed site for expand the crawling proccess. I would like to keep this structure but I would alsko like to extract the outlinks from a specific part of the web page (like only from the content of a new) for creating also an alternative LinkDB in order to know how news are linked and being linked by another news in their content. Can anybody give an idea for focusing where and how can I add that new feature? Thanks in advance from a newbie ;) -- View this message in context: http://old.nabble.com/Creating-an-alternative-Linkdb-with-part-of-the-outlinks-tp26842352p26842352.html Sent from the Nutch - Dev mailing list archive at Nabble.com.