Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

Wojtek Sun, 11 Aug 2024 03:36:46 -0700

Hi,

thanks for the link and indeed
https://issues.apache.org/jira/browse/LUCENE-7171 /
https://github.com/apache/lucene/issues/8226 seems to be the issue
here.


> Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))`
query

during update and see if it returns the correct ans?

That was the thing - it didn't hence the update failing.

However in the end I took a step back and instead of using string ID
field (constructed from mailbox-id and message-id, which are also
stored) used the query that's originally used to find the Document to
update and switch from update call using term
(`org.apache.lucene.index.IndexWriter#updateDocument()`) to one using
query (`org.apache.lucene.index.IndexWriter#updateDocuments()`) so I'm
re-using same query and... it works :) Another benefit is one less
field stored in Document :)

On 2024-08-11T02:20:38.000+02:00, Gautam Worah
<[email protected]> wrote:

> I'm confused as to what could be happening.
> 
> Google led me to this StackOverflow link:
> 
> 
>https://stackoverflow.com/questions/36402235/lucene-stringfield-gets-tokenized-when-doc-is-retrieved-and-stored-again
> 
> which references some longstanding old issues about fields changing their
> 
> "types" and so on.
> 
> The docs mention: `NOTE: only the content of a field is returned if that
> 
> field was stored during indexing.  Metadata like boost, omitNorm,
> 
> IndexOptions, tokenized, etc., are not preserved.`
> 
> Can you check what `doc.get(ID_FIELD)` returns, and if it looks right?
> 
> Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))` query
> 
> during update and see if it returns the correct ans?
> 
> If the value is not right, perhaps you may have to use the original stored
> 
> value:
> 
> 
>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/IndexSearcher.html#storedFields()
> 
> for crafting the `updateDocument()` call..
> 
> Best,
> 
> Gautam Worah.
> 
> On Sat, Aug 10, 2024 at 3:12 PM Wojtek <[email protected]> wrote:
> 
>>  Hi,
>>  
>>   thank you for reply and apologies for being somewhat "all over
>>  the
>>  
>>   place".
>>  
>>   Regarding "tokenization" - should it happen if I use StringField?
>>  
>>   When the document is created (before writing) i see in the
>>  debugger
>>  
>>   it's not tokenized and is of type StringField:
>>  
>>   ```
>>  
>>   doc = {Document@4830}
>>  
>>>>    "Document&lt;stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>"
>>  
>>   fields = {ArrayList@5920} size = 1
>>  
>>   0 = {StringField@5922}
>>  
>>>   "stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>  
>>   ```
>>  
>>   But once in the update method (document being retrieved) I see it
>>  
>>   changes to StoredField and is already "tokenized":
>>  
>>   ```
>>  
>>   doc = {Document@6526}
>>  
>>>   
>>>"Document&lt;stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>   
>>>    stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>   
>>>    stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>   
>>>    docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>"
>>  
>>   fields = {ArrayList@6548} size = 6
>>  
>>   0 = {StoredField@6550}
>>  
>>>   "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>  
>>   1 = {StoredField@6551}
>>  
>>>   "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>"
>>  
>>   2 = {StringField@6552}
>>  
>>>   "stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>"
>>  
>>   3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>"
>>  
>>   4 = {LongPoint@6554} "LongPoint <uid:1>"
>>  
>>   5 = {StoredField@6555} "stored<uid:1>"
>>  
>>   ```
>>  
>>   The code that adds the documents - it's a method implemented in
>>  James:
>>  
>>   `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>  
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#add`
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240
>>  
>>   ) that looks fairly straightforward:
>>  
>>   ```
>>  
>>   public Mono<Void> add(MailboxSession session, Mailbox mailbox,
>>  
>>   MailboxMessage membership) {
>>  
>>   return Mono.fromRunnable(Throwing.runnable(() -> {
>>  
>>   Document doc = createMessageDocument(session,
>>  
>>   membership);
>>  
>>   Document flagsDoc = createFlagsDocument(membership);
>>  
>>   writer.addDocument(doc);
>>  
>>   writer.addDocument(flagsDoc);
>>  
>>   }));
>>  
>>   }
>>  
>>   ```
>>  
>>   similarly to actual method that creates the flags
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290
>>  
>>   ):
>>  
>>   ```
>>  
>>   private Document createFlagsDocument(MailboxMessage message) {
>>  
>>   Document doc = new Document();
>>  
>>   doc.add(new StringField(ID_FIELD, "flags-" +
>>  
>>   message.getMailboxId().serialize() + "-" +
>>  
>>   Long.toString(message.getUid().asLong()), Store.YES));
>>  
>>   doc.add(new StringField(MAILBOX_ID_FIELD,
>>  
>>   message.getMailboxId().serialize(), Store.YES));
>>  
>>   doc.add(new NumericDocValuesField(UID_FIELD,
>>  
>>   message.getUid().asLong()));
>>  
>>   doc.add(new LongPoint(UID_FIELD, message.getUid().asLong()));
>>  
>>   doc.add(new StoredField(UID_FIELD, message.getUid().asLong()));
>>  
>>   indexFlags(doc, message.createFlags());
>>  
>>   return doc;
>>  
>>   }
>>  
>>   ```
>>  
>>   As you can see `StringField` is used when creating the document
>>  and to
>>  
>>   the best of my knowledge and based on what I was told - it
>>  _should_
>>  
>>   not be tokenized (?).
>>  
>>   Update (in which the document can't be updated because Term seems
>>  to
>>  
>>   be not finding it) is done in
>>  
>>   `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>  
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#update()`
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259
>>  
>>   ):
>>  
>>   ```
>>  
>>   private void update(MailboxId mailboxId, MessageUid uid, Flags f)
>>  
>>   throws IOException {
>>  
>>   try (IndexReader reader = DirectoryReader.open(writer)
>>  [http://DirectoryReader.open(writer)]) {
>>  
>>   IndexSearcher searcher = new IndexSearcher(reader);
>>  
>>   BooleanQuery.Builder queryBuilder = new
>>  
>>   BooleanQuery.Builder();
>>  
>>   queryBuilder.add(new TermQuery(new
>>  
>>   Term(MAILBOX_ID_FIELD, mailboxId.serialize())),
>>  
>>   BooleanClause.Occur.MUST);
>>  
>>   queryBuilder.add(createQuery(MessageRange.one(uid)
>>  [http://MessageRange.one(uid)]),
>>  
>>   BooleanClause.Occur.MUST);
>>  
>>   queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD,
>>  
>>   "")), BooleanClause.Occur.MUST);
>>  
>>   TopDocs docs = searcher.search(queryBuilder.build
>>  [http://searcher.search(queryBuilder.build](),
>>  
>>   100000);
>>  
>>   ScoreDoc[] sDocs = docs.scoreDocs;
>>  
>>   for (ScoreDoc sDoc : sDocs) {
>>  
>>   Document doc = searcher.doc(sDoc.doc);
>>  
>>   doc.removeFields(FLAGS_FIELD);
>>  
>>   indexFlags(doc, f);
>>  
>>   // somehow the document getting from the search
>>  
>>   lost DocValues data for the uid field, we need to re-define the
>>  field
>>  
>>   with proper DocValues.
>>  
>>   long uidValue =
>>  
>>   doc.getField("uid").numericValue().longValue();
>>  
>>   doc.removeField("uid");
>>  
>>   doc.add(new NumericDocValuesField(UID_FIELD,
>>  
>>   uidValue));
>>  
>>   doc.add(new LongPoint(UID_FIELD, uidValue));
>>  
>>   doc.add(new StoredField(UID_FIELD, uidValue));
>>  
>>   writer.updateDocument(new Term(ID_FIELD,
>>  
>>   doc.get(ID_FIELD)), doc);
>>  
>>   }
>>  
>>   }
>>  
>>   }
>>  
>>   ```
>>  
>>   I was wondering if Lucene/writer configuration is not a culprit
>>  (that
>>  
>>   would result in tokenizing even StringField) but it looks fairly
>>  
>>   straightforward:
>>  
>>   ```
>>  
>>   this.directory [http://this.directory] = directory;
>>  
>>   this.writer = new IndexWriter(this.directory
>>  [http://this.directory],
>>  
>>   createConfig(createAnalyzer(lenient), dropIndexOnStart));
>>  
>>   ```
>>  
>>   where createConfig looks like this:
>>  
>>   ```
>>  
>>   protected IndexWriterConfig createConfig(Analyzer analyzer,
>>  boolean
>>  
>>   dropIndexOnStart) {
>>  
>>   IndexWriterConfig config = new IndexWriterConfig(analyzer);
>>  
>>   if (dropIndexOnStart) {
>>  
>>   config.setOpenMode(OpenMode.CREATE);
>>  
>>   } else {
>>  
>>   config.setOpenMode(OpenMode.CREATE_OR_APPEND);
>>  
>>   }
>>  
>>   return config;
>>  
>>   }
>>  
>>   ```
>>  
>>   and createAnalyzer like this:
>>  
>>   ```
>>  
>>   protected Analyzer createAnalyzer(boolean lenient) {
>>  
>>   if (lenient) {
>>  
>>   return new LenientImapSearchAnalyzer();
>>  
>>   } else {
>>  
>>   return new StrictImapSearchAnalyzer();
>>  
>>   }
>>  
>>   }
>>  
>>   ```
>>  
>>   On 2024-08-10T21:04:15.000+02:00, Gautam Worah
>>  
>>   <[email protected]> wrote:
>>  
>>>   Hey,
>>>   
>>>    I don't think I understand the email well but I'll try my best.
> 
>  I'm confused as to what could be happening.
> 
> Google led me to this StackOverflow link:
> 
> 
>https://stackoverflow.com/questions/36402235/lucene-stringfield-gets-tokenized-when-doc-is-retrieved-and-stored-again
> 
> which references some longstanding old issues about fields changing
> their
> 
> "types" and so on.
> 
> The docs mention: `NOTE: only the content of a field is returned if
> that
> 
> field was stored during indexing. Metadata like boost, omitNorm,
> 
> IndexOptions, tokenized, etc., are not preserved.`
> 
> Can you check what `doc.get(ID_FIELD)` returns, and if it looks
> right?
> 
> Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))`
> query
> 
> during update and see if it returns the correct ans?
> 
> If the value is not right, perhaps you may have to use the original
> stored
> 
> value:
> 
> 
>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/IndexSearcher.html#storedFields()
> 
> for crafting the `updateDocument()` call..
> 
> Best,
> 
> Gautam Worah.
> 
> On Sat, Aug 10, 2024 at 3:12 PM Wojtek <[email protected]> wrote:
> 
>>  Hi,
>>  
>>   thank you for reply and apologies for being somewhat "all over
>>  the
>>  
>>   place".
>>  
>>   Regarding "tokenization" - should it happen if I use StringField?
>>  
>>   When the document is created (before writing) i see in the
>>  debugger
>>  
>>   it's not tokenized and is of type StringField:
>>  
>>   ```
>>  
>>   doc = {Document@4830}
>>  
>>>>    "Document&lt;stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>"
>>  
>>   fields = {ArrayList@5920} size = 1
>>  
>>   0 = {StringField@5922}
>>  
>>>   "stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>  
>>   ```
>>  
>>   But once in the update method (document being retrieved) I see it
>>  
>>   changes to StoredField and is already "tokenized":
>>  
>>   ```
>>  
>>   doc = {Document@6526}
>>  
>>>   
>>>"Document&lt;stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>   
>>>    stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>   
>>>    stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>   
>>>    docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>"
>>  
>>   fields = {ArrayList@6548} size = 6
>>  
>>   0 = {StoredField@6550}
>>  
>>>   "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>  
>>   1 = {StoredField@6551}
>>  
>>>   "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>"
>>  
>>   2 = {StringField@6552}
>>  
>>>   "stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>"
>>  
>>   3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>"
>>  
>>   4 = {LongPoint@6554} "LongPoint <uid:1>"
>>  
>>   5 = {StoredField@6555} "stored<uid:1>"
>>  
>>   ```
>>  
>>   The code that adds the documents - it's a method implemented in
>>  James:
>>  
>>   `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>  
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#add`
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240
>>  
>>   ) that looks fairly straightforward:
>>  
>>   ```
>>  
>>   public Mono<Void> add(MailboxSession session, Mailbox mailbox,
>>  
>>   MailboxMessage membership) {
>>  
>>   return Mono.fromRunnable(Throwing.runnable(() -> {
>>  
>>   Document doc = createMessageDocument(session,
>>  
>>   membership);
>>  
>>   Document flagsDoc = createFlagsDocument(membership);
>>  
>>   writer.addDocument(doc);
>>  
>>   writer.addDocument(flagsDoc);
>>  
>>   }));
>>  
>>   }
>>  
>>   ```
>>  
>>   similarly to actual method that creates the flags
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290
>>  
>>   ):
>>  
>>   ```
>>  
>>   private Document createFlagsDocument(MailboxMessage message) {
>>  
>>   Document doc = new Document();
>>  
>>   doc.add(new StringField(ID_FIELD, "flags-" +
>>  
>>   message.getMailboxId().serialize() + "-" +
>>  
>>   Long.toString(message.getUid().asLong()), Store.YES));
>>  
>>   doc.add(new StringField(MAILBOX_ID_FIELD,
>>  
>>   message.getMailboxId().serialize(), Store.YES));
>>  
>>   doc.add(new NumericDocValuesField(UID_FIELD,
>>  
>>   message.getUid().asLong()));
>>  
>>   doc.add(new LongPoint(UID_FIELD, message.getUid().asLong()));
>>  
>>   doc.add(new StoredField(UID_FIELD, message.getUid().asLong()));
>>  
>>   indexFlags(doc, message.createFlags());
>>  
>>   return doc;
>>  
>>   }
>>  
>>   ```
>>  
>>   As you can see `StringField` is used when creating the document
>>  and to
>>  
>>   the best of my knowledge and based on what I was told - it
>>  _should_
>>  
>>   not be tokenized (?).
>>  
>>   Update (in which the document can't be updated because Term seems
>>  to
>>  
>>   be not finding it) is done in
>>  
>>   `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>  
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#update()`
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259
>>  
>>   ):
>>  
>>   ```
>>  
>>   private void update(MailboxId mailboxId, MessageUid uid, Flags f)
>>  
>>   throws IOException {
>>  
>>   try (IndexReader reader = DirectoryReader.open(writer)
>>  [http://DirectoryReader.open(writer)]) {
>>  
>>   IndexSearcher searcher = new IndexSearcher(reader);
>>  
>>   BooleanQuery.Builder queryBuilder = new
>>  
>>   BooleanQuery.Builder();
>>  
>>   queryBuilder.add(new TermQuery(new
>>  
>>   Term(MAILBOX_ID_FIELD, mailboxId.serialize())),
>>  
>>   BooleanClause.Occur.MUST);
>>  
>>   queryBuilder.add(createQuery(MessageRange.one(uid)
>>  [http://MessageRange.one(uid)]),
>>  
>>   BooleanClause.Occur.MUST);
>>  
>>   queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD,
>>  
>>   "")), BooleanClause.Occur.MUST);
>>  
>>   TopDocs docs = searcher.search(queryBuilder.build
>>  [http://searcher.search(queryBuilder.build](),
>>  
>>   100000);
>>  
>>   ScoreDoc[] sDocs = docs.scoreDocs;
>>  
>>   for (ScoreDoc sDoc : sDocs) {
>>  
>>   Document doc = searcher.doc(sDoc.doc);
>>  
>>   doc.removeFields(FLAGS_FIELD);
>>  
>>   indexFlags(doc, f);
>>  
>>   // somehow the document getting from the search
>>  
>>   lost DocValues data for the uid field, we need to re-define the
>>  field
>>  
>>   with proper DocValues.
>>  
>>   long uidValue =
>>  
>>   doc.getField("uid").numericValue().longValue();
>>  
>>   doc.removeField("uid");
>>  
>>   doc.add(new NumericDocValuesField(UID_FIELD,
>>  
>>   uidValue));
>>  
>>   doc.add(new LongPoint(UID_FIELD, uidValue));
>>  
>>   doc.add(new StoredField(UID_FIELD, uidValue));
>>  
>>   writer.updateDocument(new Term(ID_FIELD,
>>  
>>   doc.get(ID_FIELD)), doc);
>>  
>>   }
>>  
>>   }
>>  
>>   }
>>  
>>   ```
>>  
>>   I was wondering if Lucene/writer configuration is not a culprit
>>  (that
>>  
>>   would result in tokenizing even StringField) but it looks fairly
>>  
>>   straightforward:
>>  
>>   ```
>>  
>>   this.directory [http://this.directory] = directory;
>>  
>>   this.writer = new IndexWriter(this.directory
>>  [http://this.directory],
>>  
>>   createConfig(createAnalyzer(lenient), dropIndexOnStart));
>>  
>>   ```
>>  
>>   where createConfig looks like this:
>>  
>>   ```
>>  
>>   protected IndexWriterConfig createConfig(Analyzer analyzer,
>>  boolean
>>  
>>   dropIndexOnStart) {
>>  
>>   IndexWriterConfig config = new IndexWriterConfig(analyzer);
>>  
>>   if (dropIndexOnStart) {
>>  
>>   config.setOpenMode(OpenMode.CREATE);
>>  
>>   } else {
>>  
>>   config.setOpenMode(OpenMode.CREATE_OR_APPEND);
>>  
>>   }
>>  
>>   return config;
>>  
>>   }
>>  
>>   ```
>>  
>>   and createAnalyzer like this:
>>  
>>   ```
>>  
>>   protected Analyzer createAnalyzer(boolean lenient) {
>>  
>>   if (lenient) {
>>  
>>   return new LenientImapSearchAnalyzer();
>>  
>>   } else {
>>  
>>   return new StrictImapSearchAnalyzer();
>>  
>>   }
>>  
>>   }
>>  
>>   ```
>>  
>>   On 2024-08-10T21:04:15.000+02:00, Gautam Worah
>>  
>>   <[email protected]> wrote:
>>  
>>>   Hey,
>>>   
>>>    I don't think I understand the email well but I'll try my best.
>>>   
>>>    I'm confused as to what could be happening.
> 
>  Google led me to this StackOverflow link:
> 
> 
>https://stackoverflow.com/questions/36402235/lucene-stringfield-gets-tokenized-when-doc-is-retrieved-and-stored-again
> 
> which references some longstanding old issues about fields changing
> their
> 
> "types" and so on.
> 
> The docs mention: `NOTE: only the content of a field is returned if
> that
> 
> field was stored during indexing. Metadata like boost, omitNorm,
> 
> IndexOptions, tokenized, etc., are not preserved.`
> 
> Can you check what `doc.get(ID_FIELD)` returns, and if it looks
> right?
> 
> Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))`
> query
> 
> during update and see if it returns the correct ans?
> 
> If the value is not right, perhaps you may have to use the original
> stored
> 
> value:
> 
> 
>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/IndexSearcher.html#storedFields()
> 
> for crafting the `updateDocument()` call..
> 
> Best,
> 
> Gautam Worah.
> 
> On Sat, Aug 10, 2024 at 3:12 PM Wojtek <[email protected]> wrote:
> 
>>  Hi,
>>  
>>   thank you for reply and apologies for being somewhat "all over
>>  the
>>  
>>   place".
>>  
>>   Regarding "tokenization" - should it happen if I use StringField?
>>  
>>   When the document is created (before writing) i see in the
>>  debugger
>>  
>>   it's not tokenized and is of type StringField:
>>  
>>   ```
>>  
>>   doc = {Document@4830}
>>  
>>>>    "Document&lt;stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>"
>>  
>>   fields = {ArrayList@5920} size = 1
>>  
>>   0 = {StringField@5922}
>>  
>>>   "stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>  
>>   ```
>>  
>>   But once in the update method (document being retrieved) I see it
>>  
>>   changes to StoredField and is already "tokenized":
>>  
>>   ```
>>  
>>   doc = {Document@6526}
>>  
>>>   
>>>"Document&lt;stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>   
>>>    stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>   
>>>    stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>   
>>>    docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>"
>>  
>>   fields = {ArrayList@6548} size = 6
>>  
>>   0 = {StoredField@6550}
>>  
>>>   "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>  
>>   1 = {StoredField@6551}
>>  
>>>   "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>"
>>  
>>   2 = {StringField@6552}
>>  
>>>   "stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>"
>>  
>>   3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>"
>>  
>>   4 = {LongPoint@6554} "LongPoint <uid:1>"
>>  
>>   5 = {StoredField@6555} "stored<uid:1>"
>>  
>>   ```
>>  
>>   The code that adds the documents - it's a method implemented in
>>  James:
>>  
>>   `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>  
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#add`
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240
>>  
>>   ) that looks fairly straightforward:
>>  
>>   ```
>>  
>>   public Mono<Void> add(MailboxSession session, Mailbox mailbox,
>>  
>>   MailboxMessage membership) {
>>  
>>   return Mono.fromRunnable(Throwing.runnable(() -> {
>>  
>>   Document doc = createMessageDocument(session,
>>  
>>   membership);
>>  
>>   Document flagsDoc = createFlagsDocument(membership);
>>  
>>   writer.addDocument(doc);
>>  
>>   writer.addDocument(flagsDoc);
>>  
>>   }));
>>  
>>   }
>>  
>>   ```
>>  
>>   similarly to actual method that creates the flags
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290
>>  
>>   ):
>>  
>>   ```
>>  
>>   private Document createFlagsDocument(MailboxMessage message) {
>>  
>>   Document doc = new Document();
>>  
>>   doc.add(new StringField(ID_FIELD, "flags-" +
>>  
>>   message.getMailboxId().serialize() + "-" +
>>  
>>   Long.toString(message.getUid().asLong()), Store.YES));
>>  
>>   doc.add(new StringField(MAILBOX_ID_FIELD,
>>  
>>   message.getMailboxId().serialize(), Store.YES));
>>  
>>   doc.add(new NumericDocValuesField(UID_FIELD,
>>  
>>   message.getUid().asLong()));
>>  
>>   doc.add(new LongPoint(UID_FIELD, message.getUid().asLong()));
>>  
>>   doc.add(new StoredField(UID_FIELD, message.getUid().asLong()));
>>  
>>   indexFlags(doc, message.createFlags());
>>  
>>   return doc;
>>  
>>   }
>>  
>>   ```
>>  
>>   As you can see `StringField` is used when creating the document
>>  and to
>>  
>>   the best of my knowledge and based on what I was told - it
>>  _should_
>>  
>>   not be tokenized (?).
>>  
>>   Update (in which the document can't be updated because Term seems
>>  to
>>  
>>   be not finding it) is done in
>>  
>>   `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>  
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#update()`
>>  
>>   (
>>  
>>   
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259
>>  
>>   ):
>>  
>>   ```
>>  
>>   private void update(MailboxId mailboxId, MessageUid uid, Flags f)
>>  
>>   throws IOException {
>>  
>>   try (IndexReader reader = DirectoryReader.open(writer)
>>  [http://DirectoryReader.open(writer)]) {
>>  
>>   IndexSearcher searcher = new IndexSearcher(reader);
>>  
>>   BooleanQuery.Builder queryBuilder = new
>>  
>>   BooleanQuery.Builder();
>>  
>>   queryBuilder.add(new TermQuery(new
>>  
>>   Term(MAILBOX_ID_FIELD, mailboxId.serialize())),
>>  
>>   BooleanClause.Occur.MUST);
>>  
>>   queryBuilder.add(createQuery(MessageRange.one(uid)
>>  [http://MessageRange.one(uid)]),
>>  
>>   BooleanClause.Occur.MUST);
>>  
>>   queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD,
>>  
>>   "")), BooleanClause.Occur.MUST);
>>  
>>   TopDocs docs = searcher.search(queryBuilder.build
>>  [http://searcher.search(queryBuilder.build](),
>>  
>>   100000);
>>  
>>   ScoreDoc[] sDocs = docs.scoreDocs;
>>  
>>   for (ScoreDoc sDoc : sDocs) {
>>  
>>   Document doc = searcher.doc(sDoc.doc);
>>  
>>   doc.removeFields(FLAGS_FIELD);
>>  
>>   indexFlags(doc, f);
>>  
>>   // somehow the document getting from the search
>>  
>>   lost DocValues data for the uid field, we need to re-define the
>>  field
>>  
>>   with proper DocValues.
>>  
>>   long uidValue =
>>  
>>   doc.getField("uid").numericValue().longValue();
>>  
>>   doc.removeField("uid");
>>  
>>   doc.add(new NumericDocValuesField(UID_FIELD,
>>  
>>   uidValue));
>>  
>>   doc.add(new LongPoint(UID_FIELD, uidValue));
>>  
>>   doc.add(new StoredField(UID_FIELD, uidValue));
>>  
>>   writer.updateDocument(new Term(ID_FIELD,
>>  
>>   doc.get(ID_FIELD)), doc);
>>  
>>   }
>>  
>>   }
>>  
>>   }
>>  
>>   ```
>>  
>>   I was wondering if Lucene/writer configuration is not a culprit
>>  (that
>>  
>>   would result in tokenizing even StringField) but it looks fairly
>>  
>>   straightforward:
>>  
>>   ```
>>  
>>   this.directory [http://this.directory] = directory;
>>  
>>   this.writer = new IndexWriter(this.directory
>>  [http://this.directory],
>>  
>>   createConfig(createAnalyzer(lenient), dropIndexOnStart));
>>  
>>   ```
>>  
>>   where createConfig looks like this:
>>  
>>   ```
>>  
>>   protected IndexWriterConfig createConfig(Analyzer analyzer,
>>  boolean
>>  
>>   dropIndexOnStart) {
>>  
>>   IndexWriterConfig config = new IndexWriterConfig(analyzer);
>>  
>>   if (dropIndexOnStart) {
>>  
>>   config.setOpenMode(OpenMode.CREATE);
>>  
>>   } else {
>>  
>>   config.setOpenMode(OpenMode.CREATE_OR_APPEND);
>>  
>>   }
>>  
>>   return config;
>>  
>>   }
>>  
>>   ```
>>  
>>   and createAnalyzer like this:
>>  
>>   ```
>>  
>>   protected Analyzer createAnalyzer(boolean lenient) {
>>  
>>   if (lenient) {
>>  
>>   return new LenientImapSearchAnalyzer();
>>  
>>   } else {
>>  
>>   return new StrictImapSearchAnalyzer();
>>  
>>   }
>>  
>>   }
>>  
>>   ```
>>  
>>   On 2024-08-10T21:04:15.000+02:00, Gautam Worah
>>  
>>   <[email protected]> wrote:
>>  
>>>   Hey,
>>>   
>>>    I don't think I understand the email well but I'll try my best.
> 
>  &g> >>>>>>
> 
>>>>>>>>>         Hi all!
>>>>>>>>>         
>>>>>>>>>          There is an effort in Apache James to update to a
>>>>>>>>>         more
>>>>>>>>>         
>>>>>>>>>          modern
>>>>>>>>>         
>>>>>>>>>          version of
>>>>>>>>>         
>>>>>>>>>          Lucene (ref:
>>>>>>>>>         
>>>>>>>>>          https://github.com/apache/james-project/pull/2342).
>>>>>>>>>         I'm
>>>>>>>>>         
>>>>>>>>>          digging
>>>>>>>>>         
>>>>>>>>>          into the
>>>>>>>>>         
>>>>>>>>>          issue as other have done
>>>>>>>>>         
>>>>>>>>>          but I'm stumped - it seems that
>>>>>>>>>         
>>>>>>>>>          `org.apache.lucene.index.IndexWriter#updateDocument`
>>>>>>>>>         
>>>>>>>>>          doesn't
>>>>>>>>>         
>>>>>>>>>          update
>>>>>>>>>         
>>>>>>>>>          the document.
>>>>>>>>>         
>>>>>>>>>          Documentation
>>>>>>>>>         
>>>>>>>>>          (
>>  
>>   
>>https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable)
>>  
>>>>>>>       )
>>>>>>>       
>>>>>>>>>         states:
>>>>>>>>>         
>>>>>>>>>          Updates a document by first deleting the
>>>>>>>>>         document(s)
>>>>>>>>>         
>>>>>>>>>          containing
>>>>>>>>>         
>>>>>>>>>          term
>>>>>>>>>         
>>>>>>>>>          and then adding the new
>>>>>>>>>         
>>>>>>>>>          document. The delete and then add are atomic as
>>>>>>>>>         seen by
>>>>>>>>>         
>>>>>>>>>          a
>>>>>>>>>         
>>>>>>>>>          reader
>>>>>>>>>         
>>>>>>>>>          on the
>>>>>>>>>         
>>>>>>>>>          same index (flush may happen
>>>>>>>>>         
>>>>>>>>>          only after the add).
>>>>>>>>>         
>>>>>>>>>          Here is a simple test with it:
>>  
>>   
>>https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java
>>  
>>>>>>>>>         but it fails.
>>>>>>>>>         
>>>>>>>>>          Any guidance would be appreciated because I (and
>>>>>>>>>         
>>>>>>>>>          others)
>>>>>>>>>         
>>>>>>>>>          have
>>>>>>>>>         
>>>>>>>>>          been hitting
>>>>>>>>>         
>>>>>>>>>          wall with it :)
>>>>>>>>>         
>>>>>>>>>          --
>>>>>>>>>         
>>>>>>>>>          Wojtek
>>  
>>>>>>>>>         
>>>>>>>>>---------------------------------------------------------------------
>>>>>>>>>         
>>>>>>>>>          To unsubscribe, e-mail:
>>>>>>>>>         
>>>>>>>>>          [email protected]
>>>>>>>>>         
>>>>>>>>>          For additional commands, e-mail:
>>>>>>>>>         
>>>>>>>>>          [email protected]

Re: Updating document with IndexWriter#updateDocument doesn't seem to take effect

Reply via email to