Hi,
thanks for the link and indeed
https://issues.apache.org/jira/browse/LUCENE-7171 /
https://github.com/apache/lucene/issues/8226 seems to be the issue
here.
> Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))`
query
during update and see if it returns the correct ans?
That was the thing - it didn't hence the update failing.
However in the end I took a step back and instead of using string ID
field (constructed from mailbox-id and message-id, which are also
stored) used the query that's originally used to find the Document to
update and switch from update call using term
(`org.apache.lucene.index.IndexWriter#updateDocument()`) to one using
query (`org.apache.lucene.index.IndexWriter#updateDocuments()`) so I'm
re-using same query and... it works :) Another benefit is one less
field stored in Document :)
On 2024-08-11T02:20:38.000+02:00, Gautam Worah
<[email protected]> wrote:
> I'm confused as to what could be happening.
>
> Google led me to this StackOverflow link:
>
>
>https://stackoverflow.com/questions/36402235/lucene-stringfield-gets-tokenized-when-doc-is-retrieved-and-stored-again
>
> which references some longstanding old issues about fields changing their
>
> "types" and so on.
>
> The docs mention: `NOTE: only the content of a field is returned if that
>
> field was stored during indexing. Metadata like boost, omitNorm,
>
> IndexOptions, tokenized, etc., are not preserved.`
>
> Can you check what `doc.get(ID_FIELD)` returns, and if it looks right?
>
> Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))` query
>
> during update and see if it returns the correct ans?
>
> If the value is not right, perhaps you may have to use the original stored
>
> value:
>
>
>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/IndexSearcher.html#storedFields()
>
> for crafting the `updateDocument()` call..
>
> Best,
>
> Gautam Worah.
>
> On Sat, Aug 10, 2024 at 3:12 PM Wojtek <[email protected]> wrote:
>
>> Hi,
>>
>> thank you for reply and apologies for being somewhat "all over
>> the
>>
>> place".
>>
>> Regarding "tokenization" - should it happen if I use StringField?
>>
>> When the document is created (before writing) i see in the
>> debugger
>>
>> it's not tokenized and is of type StringField:
>>
>> ```
>>
>> doc = {Document@4830}
>>
>>>> "Document<stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>"
>>
>> fields = {ArrayList@5920} size = 1
>>
>> 0 = {StringField@5922}
>>
>>> "stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>
>> ```
>>
>> But once in the update method (document being retrieved) I see it
>>
>> changes to StoredField and is already "tokenized":
>>
>> ```
>>
>> doc = {Document@6526}
>>
>>>
>>>"Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>
>>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>
>>> stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>
>>> docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>"
>>
>> fields = {ArrayList@6548} size = 6
>>
>> 0 = {StoredField@6550}
>>
>>> "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>
>> 1 = {StoredField@6551}
>>
>>> "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>"
>>
>> 2 = {StringField@6552}
>>
>>> "stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>"
>>
>> 3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>"
>>
>> 4 = {LongPoint@6554} "LongPoint <uid:1>"
>>
>> 5 = {StoredField@6555} "stored<uid:1>"
>>
>> ```
>>
>> The code that adds the documents - it's a method implemented in
>> James:
>>
>> `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#add`
>>
>> (
>>
>>
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240
>>
>> ) that looks fairly straightforward:
>>
>> ```
>>
>> public Mono<Void> add(MailboxSession session, Mailbox mailbox,
>>
>> MailboxMessage membership) {
>>
>> return Mono.fromRunnable(Throwing.runnable(() -> {
>>
>> Document doc = createMessageDocument(session,
>>
>> membership);
>>
>> Document flagsDoc = createFlagsDocument(membership);
>>
>> writer.addDocument(doc);
>>
>> writer.addDocument(flagsDoc);
>>
>> }));
>>
>> }
>>
>> ```
>>
>> similarly to actual method that creates the flags
>>
>> (
>>
>>
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290
>>
>> ):
>>
>> ```
>>
>> private Document createFlagsDocument(MailboxMessage message) {
>>
>> Document doc = new Document();
>>
>> doc.add(new StringField(ID_FIELD, "flags-" +
>>
>> message.getMailboxId().serialize() + "-" +
>>
>> Long.toString(message.getUid().asLong()), Store.YES));
>>
>> doc.add(new StringField(MAILBOX_ID_FIELD,
>>
>> message.getMailboxId().serialize(), Store.YES));
>>
>> doc.add(new NumericDocValuesField(UID_FIELD,
>>
>> message.getUid().asLong()));
>>
>> doc.add(new LongPoint(UID_FIELD, message.getUid().asLong()));
>>
>> doc.add(new StoredField(UID_FIELD, message.getUid().asLong()));
>>
>> indexFlags(doc, message.createFlags());
>>
>> return doc;
>>
>> }
>>
>> ```
>>
>> As you can see `StringField` is used when creating the document
>> and to
>>
>> the best of my knowledge and based on what I was told - it
>> _should_
>>
>> not be tokenized (?).
>>
>> Update (in which the document can't be updated because Term seems
>> to
>>
>> be not finding it) is done in
>>
>> `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#update()`
>>
>> (
>>
>>
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259
>>
>> ):
>>
>> ```
>>
>> private void update(MailboxId mailboxId, MessageUid uid, Flags f)
>>
>> throws IOException {
>>
>> try (IndexReader reader = DirectoryReader.open(writer)
>> [http://DirectoryReader.open(writer)]) {
>>
>> IndexSearcher searcher = new IndexSearcher(reader);
>>
>> BooleanQuery.Builder queryBuilder = new
>>
>> BooleanQuery.Builder();
>>
>> queryBuilder.add(new TermQuery(new
>>
>> Term(MAILBOX_ID_FIELD, mailboxId.serialize())),
>>
>> BooleanClause.Occur.MUST);
>>
>> queryBuilder.add(createQuery(MessageRange.one(uid)
>> [http://MessageRange.one(uid)]),
>>
>> BooleanClause.Occur.MUST);
>>
>> queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD,
>>
>> "")), BooleanClause.Occur.MUST);
>>
>> TopDocs docs = searcher.search(queryBuilder.build
>> [http://searcher.search(queryBuilder.build](),
>>
>> 100000);
>>
>> ScoreDoc[] sDocs = docs.scoreDocs;
>>
>> for (ScoreDoc sDoc : sDocs) {
>>
>> Document doc = searcher.doc(sDoc.doc);
>>
>> doc.removeFields(FLAGS_FIELD);
>>
>> indexFlags(doc, f);
>>
>> // somehow the document getting from the search
>>
>> lost DocValues data for the uid field, we need to re-define the
>> field
>>
>> with proper DocValues.
>>
>> long uidValue =
>>
>> doc.getField("uid").numericValue().longValue();
>>
>> doc.removeField("uid");
>>
>> doc.add(new NumericDocValuesField(UID_FIELD,
>>
>> uidValue));
>>
>> doc.add(new LongPoint(UID_FIELD, uidValue));
>>
>> doc.add(new StoredField(UID_FIELD, uidValue));
>>
>> writer.updateDocument(new Term(ID_FIELD,
>>
>> doc.get(ID_FIELD)), doc);
>>
>> }
>>
>> }
>>
>> }
>>
>> ```
>>
>> I was wondering if Lucene/writer configuration is not a culprit
>> (that
>>
>> would result in tokenizing even StringField) but it looks fairly
>>
>> straightforward:
>>
>> ```
>>
>> this.directory [http://this.directory] = directory;
>>
>> this.writer = new IndexWriter(this.directory
>> [http://this.directory],
>>
>> createConfig(createAnalyzer(lenient), dropIndexOnStart));
>>
>> ```
>>
>> where createConfig looks like this:
>>
>> ```
>>
>> protected IndexWriterConfig createConfig(Analyzer analyzer,
>> boolean
>>
>> dropIndexOnStart) {
>>
>> IndexWriterConfig config = new IndexWriterConfig(analyzer);
>>
>> if (dropIndexOnStart) {
>>
>> config.setOpenMode(OpenMode.CREATE);
>>
>> } else {
>>
>> config.setOpenMode(OpenMode.CREATE_OR_APPEND);
>>
>> }
>>
>> return config;
>>
>> }
>>
>> ```
>>
>> and createAnalyzer like this:
>>
>> ```
>>
>> protected Analyzer createAnalyzer(boolean lenient) {
>>
>> if (lenient) {
>>
>> return new LenientImapSearchAnalyzer();
>>
>> } else {
>>
>> return new StrictImapSearchAnalyzer();
>>
>> }
>>
>> }
>>
>> ```
>>
>> On 2024-08-10T21:04:15.000+02:00, Gautam Worah
>>
>> <[email protected]> wrote:
>>
>>> Hey,
>>>
>>> I don't think I understand the email well but I'll try my best.
>
> I'm confused as to what could be happening.
>
> Google led me to this StackOverflow link:
>
>
>https://stackoverflow.com/questions/36402235/lucene-stringfield-gets-tokenized-when-doc-is-retrieved-and-stored-again
>
> which references some longstanding old issues about fields changing
> their
>
> "types" and so on.
>
> The docs mention: `NOTE: only the content of a field is returned if
> that
>
> field was stored during indexing. Metadata like boost, omitNorm,
>
> IndexOptions, tokenized, etc., are not preserved.`
>
> Can you check what `doc.get(ID_FIELD)` returns, and if it looks
> right?
>
> Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))`
> query
>
> during update and see if it returns the correct ans?
>
> If the value is not right, perhaps you may have to use the original
> stored
>
> value:
>
>
>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/IndexSearcher.html#storedFields()
>
> for crafting the `updateDocument()` call..
>
> Best,
>
> Gautam Worah.
>
> On Sat, Aug 10, 2024 at 3:12 PM Wojtek <[email protected]> wrote:
>
>> Hi,
>>
>> thank you for reply and apologies for being somewhat "all over
>> the
>>
>> place".
>>
>> Regarding "tokenization" - should it happen if I use StringField?
>>
>> When the document is created (before writing) i see in the
>> debugger
>>
>> it's not tokenized and is of type StringField:
>>
>> ```
>>
>> doc = {Document@4830}
>>
>>>> "Document<stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>"
>>
>> fields = {ArrayList@5920} size = 1
>>
>> 0 = {StringField@5922}
>>
>>> "stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>
>> ```
>>
>> But once in the update method (document being retrieved) I see it
>>
>> changes to StoredField and is already "tokenized":
>>
>> ```
>>
>> doc = {Document@6526}
>>
>>>
>>>"Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>
>>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>
>>> stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>
>>> docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>"
>>
>> fields = {ArrayList@6548} size = 6
>>
>> 0 = {StoredField@6550}
>>
>>> "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>
>> 1 = {StoredField@6551}
>>
>>> "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>"
>>
>> 2 = {StringField@6552}
>>
>>> "stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>"
>>
>> 3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>"
>>
>> 4 = {LongPoint@6554} "LongPoint <uid:1>"
>>
>> 5 = {StoredField@6555} "stored<uid:1>"
>>
>> ```
>>
>> The code that adds the documents - it's a method implemented in
>> James:
>>
>> `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#add`
>>
>> (
>>
>>
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240
>>
>> ) that looks fairly straightforward:
>>
>> ```
>>
>> public Mono<Void> add(MailboxSession session, Mailbox mailbox,
>>
>> MailboxMessage membership) {
>>
>> return Mono.fromRunnable(Throwing.runnable(() -> {
>>
>> Document doc = createMessageDocument(session,
>>
>> membership);
>>
>> Document flagsDoc = createFlagsDocument(membership);
>>
>> writer.addDocument(doc);
>>
>> writer.addDocument(flagsDoc);
>>
>> }));
>>
>> }
>>
>> ```
>>
>> similarly to actual method that creates the flags
>>
>> (
>>
>>
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290
>>
>> ):
>>
>> ```
>>
>> private Document createFlagsDocument(MailboxMessage message) {
>>
>> Document doc = new Document();
>>
>> doc.add(new StringField(ID_FIELD, "flags-" +
>>
>> message.getMailboxId().serialize() + "-" +
>>
>> Long.toString(message.getUid().asLong()), Store.YES));
>>
>> doc.add(new StringField(MAILBOX_ID_FIELD,
>>
>> message.getMailboxId().serialize(), Store.YES));
>>
>> doc.add(new NumericDocValuesField(UID_FIELD,
>>
>> message.getUid().asLong()));
>>
>> doc.add(new LongPoint(UID_FIELD, message.getUid().asLong()));
>>
>> doc.add(new StoredField(UID_FIELD, message.getUid().asLong()));
>>
>> indexFlags(doc, message.createFlags());
>>
>> return doc;
>>
>> }
>>
>> ```
>>
>> As you can see `StringField` is used when creating the document
>> and to
>>
>> the best of my knowledge and based on what I was told - it
>> _should_
>>
>> not be tokenized (?).
>>
>> Update (in which the document can't be updated because Term seems
>> to
>>
>> be not finding it) is done in
>>
>> `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#update()`
>>
>> (
>>
>>
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259
>>
>> ):
>>
>> ```
>>
>> private void update(MailboxId mailboxId, MessageUid uid, Flags f)
>>
>> throws IOException {
>>
>> try (IndexReader reader = DirectoryReader.open(writer)
>> [http://DirectoryReader.open(writer)]) {
>>
>> IndexSearcher searcher = new IndexSearcher(reader);
>>
>> BooleanQuery.Builder queryBuilder = new
>>
>> BooleanQuery.Builder();
>>
>> queryBuilder.add(new TermQuery(new
>>
>> Term(MAILBOX_ID_FIELD, mailboxId.serialize())),
>>
>> BooleanClause.Occur.MUST);
>>
>> queryBuilder.add(createQuery(MessageRange.one(uid)
>> [http://MessageRange.one(uid)]),
>>
>> BooleanClause.Occur.MUST);
>>
>> queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD,
>>
>> "")), BooleanClause.Occur.MUST);
>>
>> TopDocs docs = searcher.search(queryBuilder.build
>> [http://searcher.search(queryBuilder.build](),
>>
>> 100000);
>>
>> ScoreDoc[] sDocs = docs.scoreDocs;
>>
>> for (ScoreDoc sDoc : sDocs) {
>>
>> Document doc = searcher.doc(sDoc.doc);
>>
>> doc.removeFields(FLAGS_FIELD);
>>
>> indexFlags(doc, f);
>>
>> // somehow the document getting from the search
>>
>> lost DocValues data for the uid field, we need to re-define the
>> field
>>
>> with proper DocValues.
>>
>> long uidValue =
>>
>> doc.getField("uid").numericValue().longValue();
>>
>> doc.removeField("uid");
>>
>> doc.add(new NumericDocValuesField(UID_FIELD,
>>
>> uidValue));
>>
>> doc.add(new LongPoint(UID_FIELD, uidValue));
>>
>> doc.add(new StoredField(UID_FIELD, uidValue));
>>
>> writer.updateDocument(new Term(ID_FIELD,
>>
>> doc.get(ID_FIELD)), doc);
>>
>> }
>>
>> }
>>
>> }
>>
>> ```
>>
>> I was wondering if Lucene/writer configuration is not a culprit
>> (that
>>
>> would result in tokenizing even StringField) but it looks fairly
>>
>> straightforward:
>>
>> ```
>>
>> this.directory [http://this.directory] = directory;
>>
>> this.writer = new IndexWriter(this.directory
>> [http://this.directory],
>>
>> createConfig(createAnalyzer(lenient), dropIndexOnStart));
>>
>> ```
>>
>> where createConfig looks like this:
>>
>> ```
>>
>> protected IndexWriterConfig createConfig(Analyzer analyzer,
>> boolean
>>
>> dropIndexOnStart) {
>>
>> IndexWriterConfig config = new IndexWriterConfig(analyzer);
>>
>> if (dropIndexOnStart) {
>>
>> config.setOpenMode(OpenMode.CREATE);
>>
>> } else {
>>
>> config.setOpenMode(OpenMode.CREATE_OR_APPEND);
>>
>> }
>>
>> return config;
>>
>> }
>>
>> ```
>>
>> and createAnalyzer like this:
>>
>> ```
>>
>> protected Analyzer createAnalyzer(boolean lenient) {
>>
>> if (lenient) {
>>
>> return new LenientImapSearchAnalyzer();
>>
>> } else {
>>
>> return new StrictImapSearchAnalyzer();
>>
>> }
>>
>> }
>>
>> ```
>>
>> On 2024-08-10T21:04:15.000+02:00, Gautam Worah
>>
>> <[email protected]> wrote:
>>
>>> Hey,
>>>
>>> I don't think I understand the email well but I'll try my best.
>>>
>>> I'm confused as to what could be happening.
>
> Google led me to this StackOverflow link:
>
>
>https://stackoverflow.com/questions/36402235/lucene-stringfield-gets-tokenized-when-doc-is-retrieved-and-stored-again
>
> which references some longstanding old issues about fields changing
> their
>
> "types" and so on.
>
> The docs mention: `NOTE: only the content of a field is returned if
> that
>
> field was stored during indexing. Metadata like boost, omitNorm,
>
> IndexOptions, tokenized, etc., are not preserved.`
>
> Can you check what `doc.get(ID_FIELD)` returns, and if it looks
> right?
>
> Maybe try a simple `new TermQuery(new Term("id", "flags-1-1"))`
> query
>
> during update and see if it returns the correct ans?
>
> If the value is not right, perhaps you may have to use the original
> stored
>
> value:
>
>
>https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/search/IndexSearcher.html#storedFields()
>
> for crafting the `updateDocument()` call..
>
> Best,
>
> Gautam Worah.
>
> On Sat, Aug 10, 2024 at 3:12 PM Wojtek <[email protected]> wrote:
>
>> Hi,
>>
>> thank you for reply and apologies for being somewhat "all over
>> the
>>
>> place".
>>
>> Regarding "tokenization" - should it happen if I use StringField?
>>
>> When the document is created (before writing) i see in the
>> debugger
>>
>> it's not tokenized and is of type StringField:
>>
>> ```
>>
>> doc = {Document@4830}
>>
>>>> "Document<stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>>"
>>
>> fields = {ArrayList@5920} size = 1
>>
>> 0 = {StringField@5922}
>>
>>> "stored,indexed,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>
>> ```
>>
>> But once in the update method (document being retrieved) I see it
>>
>> changes to StoredField and is already "tokenized":
>>
>> ```
>>
>> doc = {Document@6526}
>>
>>>
>>>"Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>
>>>
>>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>
>>>
>>> stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>
>>>
>>> docValuesType=NUMERIC<uid:1> LongPoint <uid:1> stored<uid:1>>"
>>
>> fields = {ArrayList@6548} size = 6
>>
>> 0 = {StoredField@6550}
>>
>>> "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<id:flags-1-1>"
>>
>> 1 = {StoredField@6551}
>>
>>> "stored,indexed,tokenized,omitNorms,indexOptions=DOCS<mailboxid:1>"
>>
>> 2 = {StringField@6552}
>>
>>> "stored,indexed,omitNorms,indexOptions=DOCS<flags:\FLAG>"
>>
>> 3 = {NumericDocValuesField@6553} "docValuesType=NUMERIC<uid:1>"
>>
>> 4 = {LongPoint@6554} "LongPoint <uid:1>"
>>
>> 5 = {StoredField@6555} "stored<uid:1>"
>>
>> ```
>>
>> The code that adds the documents - it's a method implemented in
>> James:
>>
>> `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#add`
>>
>> (
>>
>>
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1240
>>
>> ) that looks fairly straightforward:
>>
>> ```
>>
>> public Mono<Void> add(MailboxSession session, Mailbox mailbox,
>>
>> MailboxMessage membership) {
>>
>> return Mono.fromRunnable(Throwing.runnable(() -> {
>>
>> Document doc = createMessageDocument(session,
>>
>> membership);
>>
>> Document flagsDoc = createFlagsDocument(membership);
>>
>> writer.addDocument(doc);
>>
>> writer.addDocument(flagsDoc);
>>
>> }));
>>
>> }
>>
>> ```
>>
>> similarly to actual method that creates the flags
>>
>> (
>>
>>
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1290
>>
>> ):
>>
>> ```
>>
>> private Document createFlagsDocument(MailboxMessage message) {
>>
>> Document doc = new Document();
>>
>> doc.add(new StringField(ID_FIELD, "flags-" +
>>
>> message.getMailboxId().serialize() + "-" +
>>
>> Long.toString(message.getUid().asLong()), Store.YES));
>>
>> doc.add(new StringField(MAILBOX_ID_FIELD,
>>
>> message.getMailboxId().serialize(), Store.YES));
>>
>> doc.add(new NumericDocValuesField(UID_FIELD,
>>
>> message.getUid().asLong()));
>>
>> doc.add(new LongPoint(UID_FIELD, message.getUid().asLong()));
>>
>> doc.add(new StoredField(UID_FIELD, message.getUid().asLong()));
>>
>> indexFlags(doc, message.createFlags());
>>
>> return doc;
>>
>> }
>>
>> ```
>>
>> As you can see `StringField` is used when creating the document
>> and to
>>
>> the best of my knowledge and based on what I was told - it
>> _should_
>>
>> not be tokenized (?).
>>
>> Update (in which the document can't be updated because Term seems
>> to
>>
>> be not finding it) is done in
>>
>> `org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex
>>
>>[http://org.apache.james.mailbox.lucene.search.LuceneMessageSearchIndex]#update()`
>>
>> (
>>
>>
>>https://github.com/apache/james-project/blob/85ec4fbfe20637ce50b469ccaf394e6a8509ad6b/mailbox/lucene/src/main/java/org/apache/james/mailbox/lucene/search/LuceneMessageSearchIndex.java#L1259
>>
>> ):
>>
>> ```
>>
>> private void update(MailboxId mailboxId, MessageUid uid, Flags f)
>>
>> throws IOException {
>>
>> try (IndexReader reader = DirectoryReader.open(writer)
>> [http://DirectoryReader.open(writer)]) {
>>
>> IndexSearcher searcher = new IndexSearcher(reader);
>>
>> BooleanQuery.Builder queryBuilder = new
>>
>> BooleanQuery.Builder();
>>
>> queryBuilder.add(new TermQuery(new
>>
>> Term(MAILBOX_ID_FIELD, mailboxId.serialize())),
>>
>> BooleanClause.Occur.MUST);
>>
>> queryBuilder.add(createQuery(MessageRange.one(uid)
>> [http://MessageRange.one(uid)]),
>>
>> BooleanClause.Occur.MUST);
>>
>> queryBuilder.add(new PrefixQuery(new Term(FLAGS_FIELD,
>>
>> "")), BooleanClause.Occur.MUST);
>>
>> TopDocs docs = searcher.search(queryBuilder.build
>> [http://searcher.search(queryBuilder.build](),
>>
>> 100000);
>>
>> ScoreDoc[] sDocs = docs.scoreDocs;
>>
>> for (ScoreDoc sDoc : sDocs) {
>>
>> Document doc = searcher.doc(sDoc.doc);
>>
>> doc.removeFields(FLAGS_FIELD);
>>
>> indexFlags(doc, f);
>>
>> // somehow the document getting from the search
>>
>> lost DocValues data for the uid field, we need to re-define the
>> field
>>
>> with proper DocValues.
>>
>> long uidValue =
>>
>> doc.getField("uid").numericValue().longValue();
>>
>> doc.removeField("uid");
>>
>> doc.add(new NumericDocValuesField(UID_FIELD,
>>
>> uidValue));
>>
>> doc.add(new LongPoint(UID_FIELD, uidValue));
>>
>> doc.add(new StoredField(UID_FIELD, uidValue));
>>
>> writer.updateDocument(new Term(ID_FIELD,
>>
>> doc.get(ID_FIELD)), doc);
>>
>> }
>>
>> }
>>
>> }
>>
>> ```
>>
>> I was wondering if Lucene/writer configuration is not a culprit
>> (that
>>
>> would result in tokenizing even StringField) but it looks fairly
>>
>> straightforward:
>>
>> ```
>>
>> this.directory [http://this.directory] = directory;
>>
>> this.writer = new IndexWriter(this.directory
>> [http://this.directory],
>>
>> createConfig(createAnalyzer(lenient), dropIndexOnStart));
>>
>> ```
>>
>> where createConfig looks like this:
>>
>> ```
>>
>> protected IndexWriterConfig createConfig(Analyzer analyzer,
>> boolean
>>
>> dropIndexOnStart) {
>>
>> IndexWriterConfig config = new IndexWriterConfig(analyzer);
>>
>> if (dropIndexOnStart) {
>>
>> config.setOpenMode(OpenMode.CREATE);
>>
>> } else {
>>
>> config.setOpenMode(OpenMode.CREATE_OR_APPEND);
>>
>> }
>>
>> return config;
>>
>> }
>>
>> ```
>>
>> and createAnalyzer like this:
>>
>> ```
>>
>> protected Analyzer createAnalyzer(boolean lenient) {
>>
>> if (lenient) {
>>
>> return new LenientImapSearchAnalyzer();
>>
>> } else {
>>
>> return new StrictImapSearchAnalyzer();
>>
>> }
>>
>> }
>>
>> ```
>>
>> On 2024-08-10T21:04:15.000+02:00, Gautam Worah
>>
>> <[email protected]> wrote:
>>
>>> Hey,
>>>
>>> I don't think I understand the email well but I'll try my best.
>
> &g> >>>>>>
>
>>>>>>>>> Hi all!
>>>>>>>>>
>>>>>>>>> There is an effort in Apache James to update to a
>>>>>>>>> more
>>>>>>>>>
>>>>>>>>> modern
>>>>>>>>>
>>>>>>>>> version of
>>>>>>>>>
>>>>>>>>> Lucene (ref:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/james-project/pull/2342).
>>>>>>>>> I'm
>>>>>>>>>
>>>>>>>>> digging
>>>>>>>>>
>>>>>>>>> into the
>>>>>>>>>
>>>>>>>>> issue as other have done
>>>>>>>>>
>>>>>>>>> but I'm stumped - it seems that
>>>>>>>>>
>>>>>>>>> `org.apache.lucene.index.IndexWriter#updateDocument`
>>>>>>>>>
>>>>>>>>> doesn't
>>>>>>>>>
>>>>>>>>> update
>>>>>>>>>
>>>>>>>>> the document.
>>>>>>>>>
>>>>>>>>> Documentation
>>>>>>>>>
>>>>>>>>> (
>>
>>
>>https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,java.lang.Iterable)
>>
>>>>>>> )
>>>>>>>
>>>>>>>>> states:
>>>>>>>>>
>>>>>>>>> Updates a document by first deleting the
>>>>>>>>> document(s)
>>>>>>>>>
>>>>>>>>> containing
>>>>>>>>>
>>>>>>>>> term
>>>>>>>>>
>>>>>>>>> and then adding the new
>>>>>>>>>
>>>>>>>>> document. The delete and then add are atomic as
>>>>>>>>> seen by
>>>>>>>>>
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>> reader
>>>>>>>>>
>>>>>>>>> on the
>>>>>>>>>
>>>>>>>>> same index (flush may happen
>>>>>>>>>
>>>>>>>>> only after the add).
>>>>>>>>>
>>>>>>>>> Here is a simple test with it:
>>
>>
>>https://github.com/woj-tek/lucene-update-test/blob/master/src/test/java/se/unir/AppTest.java
>>
>>>>>>>>> but it fails.
>>>>>>>>>
>>>>>>>>> Any guidance would be appreciated because I (and
>>>>>>>>>
>>>>>>>>> others)
>>>>>>>>>
>>>>>>>>> have
>>>>>>>>>
>>>>>>>>> been hitting
>>>>>>>>>
>>>>>>>>> wall with it :)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Wojtek
>>
>>>>>>>>>
>>>>>>>>>---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail:
>>>>>>>>>
>>>>>>>>> [email protected]
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>
>>>>>>>>> [email protected]