Lucene 2.4 introduced a change not documented on the File Formats page

*LUCENE-510: The index now stores strings as true UTF-8 bytes (previously it
was Java's modified UTF-8). If any text, either stored fields or a token,
has illegal UTF-16 surrogate characters, these characters are now silently
replaced with the Unicode replacement character U+FFFD. This is a change to
the index file format.*
*(Marvin Humphrey via Mike McCandless)*


Is there a reason this change isn't documened on the File Formats page of
any 2.4+ doc release?

>From the 3.0.1 docs:

*http://lucene.apache.org/java/3_0_1/fileformats.html*<http://lucene.apache.org/java/3_0_1/fileformats.html>
*"**Compatibility notes are provided in this document, describing how file
formats have changed from prior versions.*

*In version 2.1, the file format was changed to allow lock-less commits (ie,
no more commit lock). The change is fully backwards compatible: you can open
a pre-2.1 index for searching or adding/deleting of docs. When the new
segments file is saved (committed), it will be written in the new file
format (meaning no specific "upgrade" process is needed). But note that once
a commit has occurred, pre-2.1 Lucene will not be able to read the index.*

*In version 2.3, the file format was changed to allow segments to share a
single set of doc store (vectors & stored fields) files. This allows for
faster indexing in certain cases. The change is fully backwards compatible
(in the same way as the lock-less commits change in 2.1)."*


But no mention of the unicode change in version 2.4... was it somehow
forwards compatible?

Can I read indexes created in 2.4 with a 2.3 compliant reader?
(Specifically, I'm interested in creating indexes with Lucene 3 and reading
them with CLucene on an iPhone.)

Thanks,

Nathanael

Reply via email to