Re: Lucene index update

Erick Erickson Thu, 04 Jan 2007 07:40:27 -0800

for approach <2>, I *think* you can extract information about unstored data
by playing with TermDocs/TermEnums. Conceptually, the idea is to go through
all the terms and, for document (lucene ID) 1 find the terms that appear in
document 1 and order them by their termpositions. Repeat for document 2
etc.. WARNING: I haven't done this personally, so I may be way off base.
Also, if the Luke source code is available, that program reconstructs
unstored fields, so you might want to take a look at that. But this will
inevitably be a lossy process.


I have to ask, though, why you're rejecting option <4> so early. To modify a
document in an index, you have to delete the document and re-add it. Unless
the time to assemble the document to be re-indexed is prohibitive, I'd think
about this option more seriously.

Remember that not all documents in an index have to have the same fields.
So, if you're only adding meta-data to *some* documents, you can freely just
change those docs and ignore all the others. Your searcher does have to deal
with null fields in some documents in this case.

If you're going to re-index all documents, I'd question whether playing
around with re-constructing them from the index is really a speed savings.
Again, it will be if the time to assemble each document is large (say lots
of DB queries or web crawling). But even in this case, is it the difference
between 1 hour and 3 hours? or 1 day and two weeks? If you can get it built
in a night, I'd do it the simple way.

How long did it take to create the index originally anyway?

Best
Erick

On 1/4/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:


Hi All,



I want to update some documents in existing indexes by adding a new field
to
each of their documents. The documents contained in the indexes have some
fields that are indexed and NOT stored. The new field that will be added
will contain some metadata and will be Stored and not indexed.



To fulfill the task I think I have 4 possible approaches:



1. To use Lucene API to do this. According to my research I found that
Lucene does not have methods for updating documents directly.



2. To extract documents including their fields (including the most
important
fields - those that are indexed but not stored). Then to add the new field
with relevant value to these documents, delete them from the index and
finally add them again in the index (including their new field).

    This is the approach recommended in the book "Lucene in Action" and in
the FAQ page on the Lucene's site - delete documents and add them again.

    The problem here is that according to Lucene's documentation "fields
which are not stored are not available in documents retrieved from the
index, e.g. with Hits.doc(int), Searcher.doc(int) or
IndexReader.document(int)". (The tests showed the same :) ). So if this
approach could work the problem here is: How to extract information of the
fields that are indexed but not stored?



3. To create a tool that changes the data stored in the indexes. This
approach seems to be too "dangerous" but if anyone knows the format that
Lucene uses to keep these data please tell me.



4. To reindex - means to delete existing indexes and create them in new by
adding the necessary fields. As the indexes could be very large creating
them in new will be too slow process and so this approach is least
desirable. But I am afraid it is the only one that is practically
possible.



If someone could help me with the approaches 1-3 or tell me some other
please help me :)



Thanks in advance,

Ivan




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene index update

Reply via email to