Re: IndexingChain and TermHash

Renaud Delbru Fri, 11 Dec 2009 09:19:57 -0800

Hi Michael,

I am reporting my experience with the codec interface. I havesuccessfully implemented my own encoding, which is a kind of simplifiedtree-based encoding (similarly to what you can find in XML IR). You canhave more information about my project (siren) on [1]. The basic idea isto encode a term with three different identifiers, doc id, tuple id, andcell id, instead of only the doc id. Each term therefore belongs to atree leaf and are tagged with the leaf path (doc id, tuple id, cell id).

I have converted the siren project to use my new encoding, all the unittests are passing, which is good news (which means, no problem with theskip lists, term enumeration or posting list reading).

For my use case, I had to "hijack" the normal use of the payloadinterface. Indeed, the codec is receiving only the followinginformation: doc id, position, and payload. In order to pass the tupleid and cell id to my codec, I had to encode them into a payload in myanalyzer, then decode them in my codec (in theStandardPositionsConsumer) to encode it into the index (and not encodingthem as payload as in the standard codec). Then, in theStandardPositionProducer, I had to decode them from the index andre-encode them into the payload interface in order to made the segmentmerger working properly.So, my remark here is about a potential improvement for the codecinterface. I don't know if it can be done easily and if it is worth it,but maybe an interface (optional parameter) that allow to passadditional information from the analyzers (e.g., certain attributes)directly into the codec can be handy (and without passing them using thepayload features as I have done it).

Another minor problem is that in the current 1458 branch, theIndexReader.open method that accepts the Codecs object is private. So,for the moment, I am obliged to first open an IndexWriter with mycodecs, and then use the IndexWriter.getReader to get an IndexReader.

Otherwise, congratulation for this very nice feature and piece of work.This is something I wanted for a long time (I am doing research in thedomain of inverted index data structure), and this feature opens a widerange of new possibilities.

I am planning to implement variants of my current codecs in a short termperiod, and more complex one (with other skip list methods) in a mediumterm period. I will continue to follow the advancement of 1458, test it,and continue to report you my feedbacks and experiences with it.


Thanks,
Best Regards

[1] http://siren.sindice.com
--
Renaud Delbru

On 16/11/09 13:01, Michael McCandless wrote:

Yes, the branch is here:

     https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458

Mark (Miller) periodically re-sync's it to trunk.

All tests should pass, and if you create a new Codec, please share the
experience!

There are not yet many Codecs in existence... the branch has the
"standard" codec (closest to Lucene's current index format, but makes
some compelling improvements to the terms dict), a "pulsing" codec
(which inlines low-freq terms into the terms dict), an intblock codec
(an abstract base for building int-block codecs).  There's also the
PForDelta codec, attached to LUCENE-1410, which subclasses the
intblock codec and uses PForDelta encoding.  It's probably best to
peek at these example codecs for inspiration on how to build yours.

Mike

On Mon, Nov 16, 2009 at 7:28 AM, Renaud Delbru<renaud.del...@deri.org>  wrote:

Hi Michael,

I see there is already a huge amount of work already done in LUCENE-1458. Is
there a way to checkout the corresponding branch, and start to use it ? At
least, to see if I can extend it and create my own Codec.
I have started on my side to abstract the indexing chain of Lucene 2.9, in
order to be able to plug my own chain, but I have the impression that you've
done something similar already (with the codec abstraction). Would be a pity
to lose my time doing something less convenient that your appraoch.

Thanks.
--
Renaud Delbru

On 14/11/09 13:22, Michael McCandless wrote:

On Fri, Nov 6, 2009 at 1:34 PM, Renaud Delbru<renaud.del...@deri.org>
  wrote:

Hi Michael,

Thanks for the quick fix. I have tested it (indexing multiple documents +
searching), and it seems to work.

On 06/11/09 18:09, Michael McCandless wrote:

To be honest, you are sort of forging new territory here :)

I think so too, not an easy task ;o). I have seen that you have tried to
make modular the indexing chain of Lucene (DocumentsWriter). I still try
to
have a good understanding of the default indexing, but I would like to
see
how it is easy (or difficult) to modify the format of the postings. From
my
current understanding, it seems that only the consumer at the end of this
chain (FreqProxTermsWriter and its consumer FormatPostingsFieldsWriter)
has
to be changed to a certain extend.

Right, those two classes do the writing of the postings, currently.

But with flexible indexing (LUCENE-1458), still in progress, we hope
to make it more easily pluggable, the codec that actually reads&
writes the postings.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexingChain and TermHash

Reply via email to