Re: Implementing indexing of Versioned Document Collections

Alex vB Tue, 16 Nov 2010 03:36:17 -0800

Hello Pulkit,

thank you for your answer and excuse me for my late reply. I am currently
working on the payload stuff and have implemented my own Analyzer and
Tokenfilter for adding custom payloads. As far as I understand I can add
Payload for every term occurence and write this into the posting list. My
posting list now looks like this:


car -> DocID1, [Payload 1], DocID2, [Payload2]....., DocID N, [Payload N]

Where each payload is a BitSet depending on the versions of a document. I
must admit that the index is getting really big at the moment because I am
adding around 8 to 16 bytes with each payload. I have to find a good
compression for the bitvectors. 
Further I am always getting the error
org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
file if I use my own Analyzer. After I uncomment the checksum test
everything works fine. Even Luke isn't giving me an error. Any ideas?
Another problem is the BitVector creation during tokenization. I am running
through all versions during the tokenizing step for creating my bitvectors
(stored in a HashMap). So my bitvectors are completly created after the last
field is analyzed (I added every wikipedia verison as an own field).
Therefore I need to add the payload after the tokenizing step. Is this
possible? What happens if I add payload for a current term and I add another
payload for the same term later ? Is it overwritten or appended?

Greetings
Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1910449.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Implementing indexing of Versioned Document Collections

Reply via email to