Re: Creating different binary databases for indexing

Andrzej Bialecki Tue, 09 May 2006 14:46:32 -0700

Dennis Kubes wrote:

I am working on a boosting solutiong where I am having to create morebinary databases than just the linkdb, crawldb, etc. For example Icreate one for uncommon words in a page. Then I want to use thesedatabase objects inside of the indexing process, in the filters, bykey along with the linkdb, parse text ,parse data and so on.The link database and parse text and data are passed into the filtersdirectly through the filter interface. I can't pass other databasesalongside because I would have to change the interface which means Iwould have to refactor all existing indexing filters. The easiest wayI found right now in modifying the parse interface to also hold thedatabase objects that I need, but that doesn't feel like a good longterm solution.
Is there a better way to pass other keyed values (database) objectsinto the indexing filters? Should we start a discussion about if weneed this functionality in Nutch and how best to implement it. Iwould be happy to implement it but I want some discussion and opinionsfirst.

I'm not sure if I understood all your requirements.. Anyway. You canpass arbitrary Writable objects to Indexer map() and reduce(), becausethey will be wrapped into ObjectWritable. In particular, you could passsome data retrieved from an input file (using SequenceFileInputFormat),if you stored your database values previously in such file. Or you couldstick the primary key to the DB record inside CrawlDatum.metaData, andthen retrieve record data from the DB during reduce ...

All of the above you can accomplish without changing any of theinterfaces, just by adding properly formatted data to the input, andthen using an indexing plugin that can consume this particular type ofinput data.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Creating different binary databases for indexing

Reply via email to