Erik Hatcher wrote:
Is it possible to inject outlinks into the ParseData stored within a segment of an already fetched/parsed page? After a bit of digging into the code, I'm not seeing anything to make this possible yet.


I'm assuming you mean "during parsing" - segment data written to disk is treated as immutable, i.e. there are no tools to modify it on-disk.

My use case is that I want to crawl a sites with RDF metadata behind the scenes. Some of the sites I'll crawl will have the link to metadata physically in the HTML, but I'd like to provide a back door for sites that have metadata but have not added links to it from the HTML yet.

For the moment, I'll create a custom HTML parser that will automatically inject them - but I was wondering if there was a more direct way to affect the ParseData outlinks so that the next fetch will pick them up.

There is a plugin hook in HTML parser, where it calls HTML filters (HtmlParser.java:207). These filters can add/modify anything collected so far. You could implement an HTMLFilter plugin similar to the creativecommons plugin, which would be automatically called here to add outlinks.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to