Hello everybody, I would like to implement the paper "Compact Full-Text Indexing of Versioned Document Collections" [1] from Torsten Suel for my diploma thesis in Lucene. The basic idea is to create a two-level index structure. On the first level a document is identified by document ID with a posting list entry if the term exists at least in one version. For every posting on the first level with term t we have a bitvector on the second one. These bitvectors contain as many bits as there are versions for one document, and bit i is set to 1 if version i contains term t or otherwise it remains 0.
http://lucene.472066.n3.nabble.com/file/n1872701/Unbenannt_1.jpg This little picture is just for demonstration purposes. It shows a posting list for the term car and is composed of 4 document IDs. If a hit is found in document 6 another look-up is needed on the second level to get the corresponding versions (version 1, 5, 7, 8, 9, 10 from 10 versions at all). At the moment I am using wikipedia (simplewiki dump) as source with a SAXParser and can resolve each document with all its versions from the XML file (Fields are Title, ID, Content(seperated for each version)). My problem is that I am unsure how to connect the second level with the first one and how to store it. The key points that are needed: - Information from posting list creation to create the bitvector (term -> doc -> versions) - Storing the bitvectors - Implementing search on second level For the first steps I disabled term frequencies and positions because the paper isn't handling them. I would be happy to get any running version at all. :) At the moment I can create bitvectors for the documents. I realized this with a HashMap<String, BitSet> in TermsHashPerField where I grab the current term in add() (I hope this is the correct location for retrieving the inverted lists terms). Anyway I can create the corret bitvectors and write them into a text file. Excerpt of bitVectors from article "April": april : 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111 never : 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000 ayriway : 0000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111 inclusive : 1111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 Next step would be storing all bitvecors in the index. At first glance I like to use an extra field to store the created bitvectors permanent in the index. It seems to be the easiest way for a first implementation without accessing the low level functions of Lucene. Can I add a field after I already started writing the document through IndexWriter? How would I do this? Or are there any other suggestions for storing? Another idea is to expand the index format of Lucene but this seems a little bit to difficult for me. Maybe I could write these information into my own file. Could anybody point me to the right direction? :) Currently I am focusing on storing and try to extend Lucenes search after the former step. THX in advance & best regards Alex [1] http://cis.poly.edu/suel/ -- View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1872701.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org