uschindler commented on PR #2342: URL: https://github.com/apache/james-project/pull/2342#issuecomment-2283340652
Hi, > The issue with StringField (for ID_FIELD) being tokenized may be related to https://issues.apache.org/jira/browse/LUCENE-7171 / https://github.com/apache/lucene/issues/8226 -- from what I can see it was reported for version 5.5 so it would fall just between original lucene version and updated version in James Uwe here from Lucene. Actually to say it "mildly", the current code (as it already exists) in James is not working correct at all. Your test cases worked with previous versions of Lucene, because they were just testing some arbitrary IDs which luckily analyze correctly. The issue is the following: https://github.com/apache/lucene/issues/8226 is not a valid bug and will never be fixed in Lucene, because theres nothing wrong. The issue here is wrong expectations and missing knowledge about how Lucene works. My recommendation would be to remove the current version of Lucene support completely and rewrite it, ideally maybe with a configurable Index schema or better - usage of Apache Solr, Elasticsearch or Opensearch (to have the indexing clearly separated and scaleable for huge installations). E.g., Dovecot IMAP server has support for Apache Solr or Elasticsearch, so indexing e-Mail is straight forwards and by adapting the schema files shipped with the repo, it is possible to customize the text analysis without changing the code. The following is the main issue: You cannot update documents in Lucene with the following pattern: Read document's stored fields with IndexReader/IndexSearcher, modify the stored fields and then write the Document instance back. This is - unfortunately possibly API-wise - but won't work. IndexReader/IndexSearcher#getDocument returns only stored fields and has no metadata about the fields behind. Because it only reads stored fields in the format they were STORED, reindexing with the original settings used during INDEXING won't work. Lucene does not know how the document was indexed originally, the information is not stored anywhere in index. When you then reindex the document it will aply default settings for all the fields in the Document (which is analyzed/tokenized with the default analyzer). This will transform all StringField instances to TextField. In addition numeric fields and DocValues fields will all change to analyzed text. In Lucene 4.x there were several approaches to separate the API of IndexReader#getDocument from IndexWriter#updateDocument, but this was not successful. So the API trap is still there (you can read stored fields from index and reindex them, but unfortunately with default settings). The correct way to update document in Lucene is the following: Rebuild the document using the same code which was used during indexing - don't use any information from index. E.g., read your document from the database/mail folder/EML file/.... and create a completely new document with applying the indexing schema that was configured by the user (languages). Important is also to index IDs as StringField, because any other field type is not supported for IDs. When you do this, the document is reachable easily using its IDs (case sensitive) and can be updated. The additional documents not deleted in your tests are coming from exactly that problem: The field was indexed correctly using StringField as ID, but later updated and therefor the ID field got a TextField. On the next update, the update document wasn't able to delete a document, because it wasn't found anymore (as the ID was tokenized and no longer a StringField, suitable for an ID). This explains why you see more documents. So your tests are failing correctly! In earlier versions of Lucene the problems caused by reindexing stored fields were not so bad, becaus ethe Analyzers are different and possinly UUIDs were not tokenized, but in Lucene 3.x there was also the possibility that a little bit more information was saved the document returned by IndexReader#getDocument so the IndexWriter code was better in restoring the original field type. Since Lucene 4 this is no longer possible. Technically, it would be better if Lucene would throw an UOE when you try to reindex a document that was retrieved via IndexReader#getDocument. Maybe we should work on this again to make this wrong use of Lucene's APIs impossible in the future. Some more ideas: - If your documents never change in the index excapt some flags (I imaging the IMAP flags like read/unread), there is a new DocValues field type, which can be updated in place, without reindexing document. You still need to reopen IndexReader/NRT reader, but you can tell IndexWriter to just update a docvalues field. Keep in mind that those fields can't be indexed or have stored fields, in-place updates only work when the field is solely a docvalues field (numeric one). So this is ideal for flags or click counts in online shops. To query docvalues only fields there are special queries which so a linear scan, but work perfectly as additional filter. Like "find all coument containing text XY that are unread". The main query is the full text query and the flag "unread" is just a filter applied. - Make sure to not manually reopen IndexReader, but use SearcherManager which makes sure to do this automatically. This feature uses NRT (Near Realtime Search), so ideal for searches on recently indexed updates. SearcherManager is new in Lucene 4+ and is the recommended way to most performantly makes documents visible for search without even committing the IndexWriter Uwe -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
