Re: Implementing indexing of Versioned Document Collections

Pulkit Singhal Wed, 10 Nov 2010 06:30:41 -0800

1) You can attach byte array "Payloads" for every occurrence of a term
during indexing. It will be stored at each term position, during indexing,
and then
can be retrieved during searching. You may want to consider taking this
approach rather than writing bitvectors to a text file. If you feel that I
should have read your thesis summary more closely and don't know what the
heck I'm talking about ... then I politely yield :)


2) Can I add a field after I already started writing the document through
IndexWriter?
Yes, you can. You can add new documents that have more fields than the ones
you added in the past. You can also "update" (internally it does a delete
then add) your document to have more fields at a later point in time.
Although in your use-case I simply didn't see why you would need to go back
and add more fields.

3) How would I do this?
a) Well if you are simply adding documents, keep adding new ones with more
fields and lucene won't complain. If you use stored fields then you may
notice inconsistencies when you pull documents in your search results
because some documents will have additional stored fields while others will
not... I am not sure how you would want to handle that.
b) If you are updating, then my personal approach would be to fetch the
document via a search that is unique enough to just get you that one doc
back. In order to do this, people usually make sure to store a unique id
when indexing the document. Then using that document, create a new one with
all the values of the old one plus the new fields you want to add. Deleted
the old doc. Add the new doc.

Good luck!

- Pulkit

On Tue, Nov 9, 2010 at 5:30 PM, Alex vB <m...@avomberg.de> wrote:

>
> Hello everybody,
>
> I would like to implement the paper "Compact Full-Text Indexing of
> Versioned
> Document Collections" [1] from Torsten Suel for my diploma thesis in
> Lucene.
> The basic idea is to create a two-level index structure. On the first level
> a document is identified by document ID with a posting list entry if the
> term exists at least in one version. For every posting on the first level
> with term t we have a bitvector on the second one. These bitvectors contain
> as many bits as there are versions for one document, and bit i is set to 1
> if version i contains term t or otherwise it remains 0.
>
> http://lucene.472066.n3.nabble.com/file/n1872701/Unbenannt_1.jpg
>
> This little picture is just for demonstration purposes. It shows a posting
> list for the term car and is composed of 4 document IDs. If a hit is found
> in document 6 another look-up is needed on the second level to get the
> corresponding versions (version 1, 5, 7, 8, 9, 10 from 10 versions at all).
>
> At the moment I am using wikipedia (simplewiki dump) as source with a
> SAXParser and can resolve each document with all its versions from the XML
> file (Fields are Title, ID, Content(seperated for each version)). My
> problem
> is that I am unsure how to connect the second level with the first one and
> how to store it. The key points that are needed:
> - Information from posting list creation to create the bitvector (term ->
> doc -> versions)
> - Storing the bitvectors
> - Implementing search on second level
>
> For the first steps I disabled term frequencies and positions because the
> paper isn't handling them. I would be happy to get any running version at
> all. :)
> At the moment I can create bitvectors for the documents. I realized this
> with a HashMap<String, BitSet> in TermsHashPerField where I grab the
> current
> term in add() (I hope this is the correct location for retrieving the
> inverted lists terms). Anyway I can create the corret bitvectors and write
> them into a text file.
> Excerpt of bitVectors from article "April":
> april :
>
> 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111
> never :
>
> 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000
> ayriway :
>
> 0000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111
> inclusive :
>
> 1111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
>
> Next step would be storing all bitvecors in the index. At first glance I
> like to use an extra field to store the created bitvectors permanent in the
> index. It seems to be the easiest way for a first implementation without
> accessing the low level functions of Lucene. Can I add a field after I
> already started writing the document through IndexWriter? How would I do
> this? Or are there any other suggestions for storing? Another idea is to
> expand the index format of Lucene but this seems a little bit to difficult
> for me. Maybe I could write these information into my own file. Could
> anybody point me to the right direction? :)
>
> Currently I am focusing on storing and try to extend Lucenes search after
> the former step.
>
> THX in advance & best regards
> Alex
>
> [1] http://cis.poly.edu/suel/
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1872701.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Implementing indexing of Versioned Document Collections

Reply via email to