Hi Guys
We have several hundred thousand indexes that have been written in Lucene 3.x 
format. I am trying to migrate to Lucene 4.10 without the need to reindex and  
the process should be transparent to our customers. Reindexing all our legacy 
data is not an option.

The predominant analyzer we currently use is ClassicAnalyzer as we needed to 
backwards compatible with the old StandardAnalyzer from pre-Lucene 3.x days. 

Our latest application uses 4.10 lucene jars and we knobble it to use Lucene 
3.x format. 
When we create IndexWiriters, we are doing this:IndexWriterConfig idxCfg = new 
IndexWriterConfig(Version.LUCENE_3_6, new ClassicAnalyzer());

New indexes could be written in Lucene 4.10 format and we aim to apply newer 
analyzers to these new indexes. So all new index reading and writing should be 
fine. We need to query lucene 4.10 indexes with lucene 4.10 analyzers and our 
architecture is such that we can query lucene 3.x indexes with lucene 3.x 
analyzers (using Lucene Versions etc).

However, there is a difference between how Lucene 3.x and Lucene 4.10 write 
indexes which breaks phrase queries. 

Lucene 3.x
Lucene 3.x seems to write tokens to the index adjacent to each other (and I 
assume the positions are stored elsewhere). This means if we index "Thanks for 
coming", it get indexed as:
"thanks", "coming" after stop-word removal.
If we use a phrase query and pass it to QueryParser as:
content:"Thanks for coming"
queryParser will (using lowercase and stopword removal via ClassicAnalyzer) 
apply the phrase query content:"thanks coming" and find the document correctly.
Lucene 4.10My understanding is that Lucene 4.10 keeps the position increments 
in the index as placeholders in the data. I believe this is due to a change in 
how StopFilter works. So if we index our "Thanks for coming" data in Lucene 
4.10, it appears to be stored as:
"thanks", <pim>, "coming"
Where <pim> is some kind of position increment marker (excuse my ignorance, I 
don't know the low level details).
Now if we send the query through QueryParser and after lowercasing and stopword 
removal it is the same as before:content:"thanks coming".
This query fails because there is a <pim> between the two terms. 

If we turn on positionIncrements in queryParser, the document is returned. A 
phaseQuery with a slop of 1 also finds the document, but that is a different 
query to that used on lucene 3.x.
So, knowing which are 3.x indexes and which are 4.x indexes means we can toggle 
positionincrements in QueryParser and our customers should be unaware of any 
changes while we start our migration. Well, that was the plan!

Lucene IndexUpgrader

If we take our old 3.x index and apply IndexUpgrader to it, we end up with a 
4.10 index. There are several lucene 4.x files created in the index directory 
and no errors are thrown. However, it appears that the index data is still in 
the 3.x format, namely it remains:
"thanks", "coming"
and not:
"thanks", <pim>, "coming"
This means that although the newly upgraded index is in theory a 4.10 index, we 
still have to use a 3.x QueryParser syntax  (positionincrement=false) for 
phrase queries. Not the end of the world, but if a new document is added "Ivan 
the Terrible", we end up with the index containing:
"thanks", "coming""ivan", <pim>, "terrible"
So now we have one record in 3.x format and one in 4.10 format and having a 
hybrid index means we cannot meaningfully use phrase queries on it.
The ProblemSo we need a way to write documents in 3.x format (no <pim>), to our 
upgraded indexes, new indexes can use native 4.10 format. 

I have tried turning off positions when writing indexes (DOCS_AND_FREQ only) 
but I can't see how phrase queries can work without positional information and 
Lucene complains about an illegal state when querying such an index.

So, my cry for help here is:How do I write documents in 3.x format to a 3.x 
index upgraded to a 4.10 index so that phrase queries work ?
Any other suggestions welcome!
Thank you for taking the time to work through this rather lengthy explanation 
and please let me know if I have not described the issue clearly.
If it's just "Duh, you need to just do this..", I'd be a happy man :-)
Many thanks,
Clive

Reply via email to