Re: Rewriting an index without losing 'hidden' data

Michael McCandless Fri, 08 Apr 2011 09:09:22 -0700

Unfortunately, updateDocument replaces the *entire* previous document
with the new one.


The ability to update a single indexed field (either replace that
field entirely, or, change only certain token occurrences within it),
while leaving all other indexed fields in the document unaffected, has
been a long requested big missing feature in Lucene.  We call it
"incremental field updates".

There have been some healthy discussions on the dev list, that have
worked out a good rough design (eg see
http://markmail.org/thread/lsfjhpiblzymkfcn).  Also, recent
improvements in how buffered deletes are handled should make it alot
easier for updates to "piggyback" using that same packet stream
approach.  So... I think there is hope some day that we'll get this
into Lucene.

Mike

http://blog.mikemccandless.com

On Fri, Apr 8, 2011 at 11:00 AM, Ian Lea <ian....@gmail.com> wrote:
> Unfortunately you just can't do this.  Might be possible if all fields
> were stored but evidently they are not in your index.  For unstored
> fields, the Document object will not contain the data that was passed
> in when the doc was originally added.
>
> I believe there might be a way of recreating some of the missing data
> via TermFreqVector but that has always sounded dodgy and lossy to me.
>
> The safest way is to reindex, however painful it might be.  Maybe you
> could take the opportunity to upgrade lucene at the same time!
>
>
> --
> Ian.
>
>
> On Fri, Apr 8, 2011 at 3:44 PM, Chris Bamford
> <chris.bamf...@talktalk.net> wrote:
>> Hi,
>>
>> I recently discovered that I need to add a single field to every document in 
>> an existing (very large) index.  Reindexing from scratch is not an option I 
>> want to consider right now, so I wrote a utility to add the field by 
>> rewriting the index - but this seemed to lose some of the fields (indexed, 
>> but not stored?).  In fact, it shrunk a 12Gb index down to 4.2Gb - clearly 
>> not what I wanted.  :-)
>> What am I doing wrong?
>>
>> My technique was:
>>
>>  Analyzer analyser = new StandardAnalyzer();
>>  IndexSearcher searcher = new IndexSearcher(indexPath);
>>  IndexWriter indexWriter = new IndexWriter(indexPath, analyser);
>>  Hits hits = matchAllDocumentsFromIndex(searcher);
>>
>>  for (int i=0; i < hits.length(); i++) {
>>          Document doc = hits.doc(i);
>>          String id = doc.get("unique-id");
>>          doc.add(new Field("newField", newValue, Field.Store.YES, 
>> Field.Index.UN_TOKENIZED));
>>          indexWriter.updateDocument(new Term("unique-id", id), doc);
>>  }
>>
>>  searcher.close();
>>  indexWriter.optimize();
>>  indexWriter.close();
>>
>> Note that my matchAllDocumentsFromIndex() does get the right number of hits 
>> from the index - i.e. the same number as held in the index.
>>
>>
>>  Thanks for any ideas!
>> BTW I am using Lucene 2.3.2.
>>
>> - Chris
>>
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Rewriting an index without losing 'hidden' data

Reply via email to