Is anyone here using Nutch for crawling digital scholarly archives? If so, are you also harvesting and indexing additional metadata?

My group (http://www.patacriticism.org) is considering using Nutch to crawl a specific set of sites and index the HTML as full-text and also retrieve any associated RDF data (specified with a hyperlink in a <meta> tag perhaps, like this page: http://www.rossettiarchive.org/ docs/1-1847.s244.raw.html). The RDF most likely could be simply indexed as additional fields, but perhaps it would also be added to an RDF engine (such as Kowari) and perhaps additionally queried in the search interface in conjunction with full-text searching.

The Ontology and Creative Commons plugins are great starting places, for sure. I'm wondering what others have done along these lines.

Thanks,
    Erik



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to