Implementing indexing of Versioned Document Collections

Alex vB Tue, 09 Nov 2010 14:30:51 -0800

Hello everybody,

I would like to implement the paper "Compact Full-Text Indexing of Versioned
Document Collections" [1] from Torsten Suel for my diploma thesis in Lucene.
The basic idea is to create a two-level index structure. On the first level
a document is identified by document ID with a posting list entry if the
term exists at least in one version. For every posting on the first level
with term t we have a bitvector on the second one. These bitvectors contain
as many bits as there are versions for one document, and bit i is set to 1
if version i contains term t or otherwise it remains 0.

http://lucene.472066.n3.nabble.com/file/n1872701/Unbenannt_1.jpg

This little picture is just for demonstration purposes. It shows a posting
list for the term car and is composed of 4 document IDs. If a hit is found
in document 6 another look-up is needed on the second level to get the
corresponding versions (version 1, 5, 7, 8, 9, 10 from 10 versions at all).

At the moment I am using wikipedia (simplewiki dump) as source with a
SAXParser and can resolve each document with all its versions from the XML
file (Fields are Title, ID, Content(seperated for each version)). My problem
is that I am unsure how to connect the second level with the first one and
how to store it. The key points that are needed:
- Information from posting list creation to create the bitvector (term ->
doc -> versions)
- Storing the bitvectors
- Implementing search on second level

For the first steps I disabled term frequencies and positions because the
paper isn't handling them. I would be happy to get any running version at
all. :)
At the moment I can create bitvectors for the documents. I realized this
with a HashMap<String, BitSet> in TermsHashPerField where I grab the current
term in add() (I hope this is the correct location for retrieving the
inverted lists terms). Anyway I can create the corret bitvectors and write
them into a text file.
Excerpt of bitVectors from article "April":
april :
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111
never :
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000
ayriway :
0000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111
inclusive :
1111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Next step would be storing all bitvecors in the index. At first glance I
like to use an extra field to store the created bitvectors permanent in the
index. It seems to be the easiest way for a first implementation without
accessing the low level functions of Lucene. Can I add a field after I
already started writing the document through IndexWriter? How would I do
this? Or are there any other suggestions for storing? Another idea is to
expand the index format of Lucene but this seems a little bit to difficult
for me. Maybe I could write these information into my own file. Could
anybody point me to the right direction? :)

Currently I am focusing on storing and try to extend Lucenes search after
the former step.

THX in advance & best regards
Alex

[1] http://cis.poly.edu/suel/
--
View this message in context:
http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1872701.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Implementing indexing of Versioned Document Collections

Reply via email to