Hello,
i'm implementing something similiar at the moment. i'm feeding nutch
with a url-list with an annotated ID. this ID must go into the lucene
index, so that i can do a 1:many relation between a database and the
crawled pages.
i've added the custom data into the meta-data field in the datum. see
InjectMapper:
// add myID to the crawlDatum as metaData
MapWritable meta = new MapWritable();
meta.put(new Text("myID"), new Text(myID));
datum.setMetaData(meta);
now the ID is saved in the CrawlDatum-Object. On the indexing-side
i've written a new plugin index-id, but it's simply a modified index-
basic ;) the essence is:
MapWritable meta = datum.getMetaData();
String id = ((Text)meta.get(new Text("myID"))).toString();
if (id != "") {
Field myid = new Field("myid", id, Field.Store.YES,
Field.Index.UN_TOKENIZED);
mederiid.setBoost(5.0f);
doc.add(myid);
LOG.info("The following ID was added to the index: " + myid);
}
So, that's where i stand at the moment. Now i have to build a custom
query interface, so that i can search in my MySQL-database and enrich
the results with my crawled sites.
maybe we can join forces. feel free to contact me :) greetings,
Sebastian Steinmetz