Issues to watch out when making changes to Lucene's gap encoding scheme

Thang Luong Minh Sat, 31 Mar 2007 22:28:12 -0800

Dear Lucene developers,

As part of my school project, I've proposed to implement a gap encoding
scheme, and try to replace it with the current scheme in Lucene which is the
byte-aligned scheme. The main purpose of making changes to Lucene source is
just for learning experience in dealing with open source code, conforming
with the current system, and other design issues.


The gap encoding I'm implementing is the Fixed Binary Coding scheme in which
each gap in a group are coded using the same number of bits. That scheme is
described in the paper here http://crpit.com/confpapers/CRPITV27Anh.pdf. The
advantage of the scheme is extremely fast decoding speed. In order to
efficiently index, we need to decompose a list of gaps into groups so that
the total number of bits used is minimized.
Ex: list of gaps is broken into 4 group (38, 17, 13, 34)  (6)  (4, 1, 3, 1)
(2, 3, 1)
A list of 4 selectors is 9, 1, 6, 9, each of which corresponds to one group.
By looking at the selector we could know how many bits used to code each gap
in the group associated with that selector, and how many gaps in that group.

Having described abt the new coding scheme, below are changes I have made to
Lucene source code:
+ I touched mostly on classes which involve proximity streams (.prx). In
DocumentWriter.writePostings(), and SegmentMerger.appendPostings(), instead
of using writeVInt for each gap, I go through the whole gap list, decompose
into groups, encode the whole list as described in the scheme, and
writeByte() to the .prx stream
+ Handle decoding gap by gap in SegmentTermPositions
+ By the nature of the scheme, I need to know the selector list in order to
decode the gaps, so I need another sel stream to store selectors. As gaps
and selectors go together, for every classes where .prx stream, and proxPointer
appear, I added another .sel stream, and selPointer. In details, I have
added to  DocumentWriter, SegmentTermEnum, SegmentMerger, SegmentReader,
SegmentTermDocs, SegmentTermPositions, TermInfo,  TermInfoWriter.

My questions:
+ Would there be too much overhead causing when an extra sel stream is added
correspond to each prox stream? What should I watch out for?
+ It seems that the new streams added have caused "Java out of heap space"
problem when it comes to the merging process for 3rd, or 4th document in my
collection. Is there any comment on that?
+ Besides those classes listed above, is there any other class or dependency
which might cause big problems when indexing large collection, and  I should
be aware of ?

I would really appreciate for any comments or advices. Thank you.

PS: is this message rather long and suitable for this mailing list? Should I
break it down into specific issues, and make it shorter for people to
address? :)

Regards,
Luong Minh Thang

Issues to watch out when making changes to Lucene's gap encoding scheme

Reply via email to