Re: Lucene index update

Erick Erickson Fri, 05 Jan 2007 05:26:20 -0800

Try http://www.getopt.org/luke/ or just google lucene luke.


If you're going to be playing with the indexes, I really, really, really
recommend that you get a copy of Luke and get familiar with it. It allows
you to examine your indexes, explains queries, shows the effects of
different analyzers on queries, etc. It's invaluable. And it's free <G>..

Best
Erick

On 1/5/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:


Thanks a lot for your help Erick,

I make simple class that uses your idea about extracting data that is
indexed but unstored by using TermDocs/TermEnums classes. It works
properly.
Yes I also think that in some cases the approach with re-indexing
documents
will be faster. It depends on the documents that will be updated - if for
example they are short and contain almost only different words the
re-indexing will be faster, but if they are big documents and contain not
so
many different words then the trick with the TermDocs/TermEnums will be
faster. Also this is the case when gathering documents from some network,
as
you mentioned.

My task is not to update some indexes but to create a tool that does it.
In
this tool I can include both classes and give user chance to choose what
approach to use by giving him advise in which cases which approach is
better.

Thank you again for your help :).

By the way could you give me some link from where I can read about the
Luke
program that reconstructs unstored fields, that you mentioned?

Bets Regards,
Ivan

----- Original Message -----
From: "Erick Erickson" <[EMAIL PROTECTED]>
To: <java-user@lucene.apache.org>
Sent: Thursday, January 04, 2007 5:40 PM
Subject: Re: Lucene index update


> for approach <2>, I *think* you can extract information about unstored
> data
> by playing with TermDocs/TermEnums. Conceptually, the idea is to go
> through
> all the terms and, for document (lucene ID) 1 find the terms that appear
> in
> document 1 and order them by their termpositions. Repeat for document 2
> etc.. WARNING: I haven't done this personally, so I may be way off base.
> Also, if the Luke source code is available, that program reconstructs
> unstored fields, so you might want to take a look at that. But this will
> inevitably be a lossy process.
>
> I have to ask, though, why you're rejecting option <4> so early. To
modify
> a
> document in an index, you have to delete the document and re-add it.
> Unless
> the time to assemble the document to be re-indexed is prohibitive, I'd
> think
> about this option more seriously.
>
> Remember that not all documents in an index have to have the same
fields.
> So, if you're only adding meta-data to *some* documents, you can freely
> just
> change those docs and ignore all the others. Your searcher does have to
> deal
> with null fields in some documents in this case.
>
> If you're going to re-index all documents, I'd question whether playing
> around with re-constructing them from the index is really a speed
savings.
> Again, it will be if the time to assemble each document is large (say
lots
> of DB queries or web crawling). But even in this case, is it the
> difference
> between 1 hour and 3 hours? or 1 day and two weeks? If you can get it
> built
> in a night, I'd do it the simple way.
>
> How long did it take to create the index originally anyway?
>
> Best
> Erick
>
> On 1/4/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:
>>
>> Hi All,
>>
>>
>>
>> I want to update some documents in existing indexes by adding a new
field
>> to
>> each of their documents. The documents contained in the indexes have
some
>> fields that are indexed and NOT stored. The new field that will be
added
>> will contain some metadata and will be Stored and not indexed.
>>
>>
>>
>> To fulfill the task I think I have 4 possible approaches:
>>
>>
>>
>> 1. To use Lucene API to do this. According to my research I found that
>> Lucene does not have methods for updating documents directly.
>>
>>
>>
>> 2. To extract documents including their fields (including the most
>> important
>> fields - those that are indexed but not stored). Then to add the new
>> field
>> with relevant value to these documents, delete them from the index and
>> finally add them again in the index (including their new field).
>>
>>     This is the approach recommended in the book "Lucene in Action" and
>> in
>> the FAQ page on the Lucene's site - delete documents and add them
again.
>>
>>     The problem here is that according to Lucene's documentation
"fields
>> which are not stored are not available in documents retrieved from the
>> index, e.g. with Hits.doc(int), Searcher.doc(int) or
>> IndexReader.document(int)". (The tests showed the same :) ). So if this
>> approach could work the problem here is: How to extract information of
>> the
>> fields that are indexed but not stored?
>>
>>
>>
>> 3. To create a tool that changes the data stored in the indexes. This
>> approach seems to be too "dangerous" but if anyone knows the format
that
>> Lucene uses to keep these data please tell me.
>>
>>
>>
>> 4. To reindex - means to delete existing indexes and create them in new
>> by
>> adding the necessary fields. As the indexes could be very large
creating
>> them in new will be too slow process and so this approach is least
>> desirable. But I am afraid it is the only one that is practically
>> possible.
>>
>>
>>
>> If someone could help me with the approaches 1-3 or tell me some other
>> please help me :)
>>
>>
>>
>> Thanks in advance,
>>
>> Ivan
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene index update

Reply via email to