Is anyone here using Nutch for crawling digital scholarly archives?
If so, are you also harvesting and indexing additional metadata?
My group (http://www.patacriticism.org) is considering using Nutch to
crawl a specific set of sites and index the HTML as full-text and
also retrieve any associated RDF data (specified with a hyperlink in
a <meta> tag perhaps, like this page: http://www.rossettiarchive.org/
docs/1-1847.s244.raw.html). The RDF most likely could be simply
indexed as additional fields, but perhaps it would also be added to
an RDF engine (such as Kowari) and perhaps additionally queried in
the search interface in conjunction with full-text searching.
The Ontology and Creative Commons plugins are great starting places,
for sure. I'm wondering what others have done along these lines.
Thanks,
Erik
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general