Brian Higgins wrote:
> Hi,
> i'm pretty new to Nutch and i'm trying to modify the code so it stores 
> the
> words before and after a hyperlink as well as the anchor text.
> i've ben looking through the nutch code for a couple of days and i'm 
> still a
> little unclear as to the layout...
> Nutch parses incoming webpages in HTMLParser.java right? i can't seem to
> find the code in here for url processing though - where exactly does it
> parse the anchor text and write it to the database?

It collects outlinks in DOMContentUtils.getOutlinks. You will need to 
get the preceding sibling nodes, or a parent node, to collect more of 
the surrounding text.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to