Re: [PR] JAMES-4046 Upgrade Lucene [james-project]

via GitHub Mon, 12 Aug 2024 01:07:39 -0700


uschindler commented on PR #2342:
URL: https://github.com/apache/james-project/pull/2342#issuecomment-2283340652


   Hi,
   
   > The issue with StringField (for ID_FIELD) being tokenized may be related 
to https://issues.apache.org/jira/browse/LUCENE-7171 / 
https://github.com/apache/lucene/issues/8226 -- from what I can see it was 
reported for version 5.5 so it would fall just between original lucene version 
and updated version in James
   
   Uwe here from Lucene. Actually to say it "mildly", the current code (as it 
already exists) in James is not working correct at all. Your test cases worked 
with previous versions of Lucene, because they were just testing some arbitrary 
IDs which luckily analyze correctly.
   
   The issue is the following: https://github.com/apache/lucene/issues/8226 is 
not a valid bug and will never be fixed in Lucene, because theres nothing 
wrong. The issue here is wrong expectations and missing knowledge about how 
Lucene works. My recommendation would be to remove the current version of 
Lucene support completely and rewrite it, ideally maybe with a configurable 
Index schema or better - usage of Apache Solr, Elasticsearch or Opensearch (to 
have the indexing clearly separated and scaleable for huge installations). 
E.g., Dovecot IMAP server has support for Apache Solr or Elasticsearch, so 
indexing e-Mail is straight forwards and by adapting the schema files shipped 
with the repo, it is possible to customize the text analysis without changing 
the code.
   
   The following is the main issue: You cannot update documents in Lucene with 
the following pattern: Read document's stored fields with 
IndexReader/IndexSearcher, modify the stored fields and then write the Document 
instance back. This is - unfortunately possibly API-wise - but won't work. 
IndexReader/IndexSearcher#getDocument returns only stored fields and has no 
metadata about the fields behind. Because it only reads stored fields in the 
format they were STORED, reindexing with the original settings used during 
INDEXING won't work. Lucene does not know how the document was indexed 
originally, the information is not stored anywhere in index. When you then 
reindex the document it will aply default settings for all the fields in the 
Document (which is analyzed/tokenized with the default analyzer). This will 
transform all StringField instances to TextField. In addition numeric fields 
and DocValues fields will all change to analyzed text. In Lucene 4.x there were 
several approaches to 
 separate the API of IndexReader#getDocument from IndexWriter#updateDocument, 
but this was not successful. So the API trap is still there (you can read 
stored fields from index and reindex them, but unfortunately with default 
settings).
   
   The correct way to update document in Lucene is the following: Rebuild the 
document using the same code which was used during indexing - don't use any 
information from index. E.g., read your document from the database/mail 
folder/EML file/.... and create a completely new document with applying the 
indexing schema that was configured by the user (languages). Important is also 
to index IDs as StringField, because any other field type is not supported for 
IDs. When you do this, the document is reachable easily using its IDs (case 
sensitive) and can be updated.
   
   The additional documents not deleted in your tests are coming from exactly 
that problem: The field was indexed correctly using StringField as ID, but 
later updated and therefor the ID field got a TextField. On the next update, 
the update document wasn't able to delete a document, because it wasn't found 
anymore (as the ID was tokenized and no longer a StringField, suitable for an 
ID). This explains why you see more documents. So your tests are failing 
correctly!
   
   In earlier versions of Lucene the problems caused by reindexing stored 
fields were not so bad, becaus ethe Analyzers are different and possinly UUIDs 
were not tokenized, but in Lucene 3.x there was also the possibility that a 
little bit more information was saved the document returned by 
IndexReader#getDocument so the IndexWriter code was better in restoring the 
original field type. Since Lucene 4 this is no longer possible. Technically, it 
would be better if Lucene would throw an UOE when you try to reindex a document 
that was retrieved via IndexReader#getDocument. Maybe we should work on this 
again to make this wrong use of Lucene's APIs impossible in the future.
   
   Some more ideas:
   - If your documents never change in the index excapt some flags (I imaging 
the IMAP flags like read/unread), there is a new DocValues field type, which 
can be updated in place, without reindexing document. You still need to reopen 
IndexReader/NRT reader, but you can tell IndexWriter to just update a docvalues 
field. Keep in mind that those fields can't be indexed or have stored fields, 
in-place updates only work when the field is solely a docvalues field (numeric 
one). So this is ideal for flags or click counts in online shops. To query 
docvalues only fields there are special queries which so a linear scan, but 
work perfectly as additional filter. Like "find all coument containing text XY 
that are unread". The main query is the full text query and the flag "unread" 
is just a filter applied.
   - Make sure to not manually reopen IndexReader, but use SearcherManager 
which makes sure to do this automatically. This feature uses NRT (Near Realtime 
Search), so ideal for searches on recently indexed updates. SearcherManager is 
new in Lucene 4+ and is the recommended way to most performantly makes 
documents visible for search without even committing the IndexWriter
   
   Uwe


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] JAMES-4046 Upgrade Lucene [james-project]

Reply via email to