Hi Guys We have several hundred thousand indexes that have been written in Lucene 3.x format. I am trying to migrate to Lucene 4.10 without the need to reindex and the process should be transparent to our customers. Reindexing all our legacy data is not an option.
The predominant analyzer we currently use is ClassicAnalyzer as we needed to backwards compatible with the old StandardAnalyzer from pre-Lucene 3.x days. Our latest application uses 4.10 lucene jars and we knobble it to use Lucene 3.x format. When we create IndexWiriters, we are doing this:IndexWriterConfig idxCfg = new IndexWriterConfig(Version.LUCENE_3_6, new ClassicAnalyzer()); New indexes could be written in Lucene 4.10 format and we aim to apply newer analyzers to these new indexes. So all new index reading and writing should be fine. We need to query lucene 4.10 indexes with lucene 4.10 analyzers and our architecture is such that we can query lucene 3.x indexes with lucene 3.x analyzers (using Lucene Versions etc). However, there is a difference between how Lucene 3.x and Lucene 4.10 write indexes which breaks phrase queries. Lucene 3.x Lucene 3.x seems to write tokens to the index adjacent to each other (and I assume the positions are stored elsewhere). This means if we index "Thanks for coming", it get indexed as: "thanks", "coming" after stop-word removal. If we use a phrase query and pass it to QueryParser as: content:"Thanks for coming" queryParser will (using lowercase and stopword removal via ClassicAnalyzer) apply the phrase query content:"thanks coming" and find the document correctly. Lucene 4.10My understanding is that Lucene 4.10 keeps the position increments in the index as placeholders in the data. I believe this is due to a change in how StopFilter works. So if we index our "Thanks for coming" data in Lucene 4.10, it appears to be stored as: "thanks", <pim>, "coming" Where <pim> is some kind of position increment marker (excuse my ignorance, I don't know the low level details). Now if we send the query through QueryParser and after lowercasing and stopword removal it is the same as before:content:"thanks coming". This query fails because there is a <pim> between the two terms. If we turn on positionIncrements in queryParser, the document is returned. A phaseQuery with a slop of 1 also finds the document, but that is a different query to that used on lucene 3.x. So, knowing which are 3.x indexes and which are 4.x indexes means we can toggle positionincrements in QueryParser and our customers should be unaware of any changes while we start our migration. Well, that was the plan! Lucene IndexUpgrader If we take our old 3.x index and apply IndexUpgrader to it, we end up with a 4.10 index. There are several lucene 4.x files created in the index directory and no errors are thrown. However, it appears that the index data is still in the 3.x format, namely it remains: "thanks", "coming" and not: "thanks", <pim>, "coming" This means that although the newly upgraded index is in theory a 4.10 index, we still have to use a 3.x QueryParser syntax (positionincrement=false) for phrase queries. Not the end of the world, but if a new document is added "Ivan the Terrible", we end up with the index containing: "thanks", "coming""ivan", <pim>, "terrible" So now we have one record in 3.x format and one in 4.10 format and having a hybrid index means we cannot meaningfully use phrase queries on it. The ProblemSo we need a way to write documents in 3.x format (no <pim>), to our upgraded indexes, new indexes can use native 4.10 format. I have tried turning off positions when writing indexes (DOCS_AND_FREQ only) but I can't see how phrase queries can work without positional information and Lucene complains about an illegal state when querying such an index. So, my cry for help here is:How do I write documents in 3.x format to a 3.x index upgraded to a 4.10 index so that phrase queries work ? Any other suggestions welcome! Thank you for taking the time to work through this rather lengthy explanation and please let me know if I have not described the issue clearly. If it's just "Duh, you need to just do this..", I'd be a happy man :-) Many thanks, Clive