On May 17, 2006, at 11:08 AM, Doug Cutting wrote:
Marvin Humphrey wrote:
What I'd like to do is augment my existing patch by making it
possible to specify a particular encoding, both for Lucene and Luke.
What ensures that all documents in fact use the same encoding?
In KinoSearch at this moment, zilch. Lucene would still need to read
stuff into Java chars then write it out using the specified
encoding. If we opt for output buffering rather than output counting
(the patch currently does counting, but that would have to change if
we're flexible about encoding in the index), then string.getBytes
(encoding) would guarantee it.
The current approach of converting everything to Unicode and then
writing UTF-8 to indexes makes indexes portable and simplifies the
construction of search user interfaces, since only indexing code
needs to know about other character sets and encodings.
Sure. OTOH, it's not so good for CJK users. I also opted against it
in KinoSearch because A) compatibility with the current Java Lucene
file format wasn't going to happen anyway, and B) not all Perlers use
or require valid UTF-8. I've considered adding a UTF8Enforcer
Analyzer subclass, but it hasn't been an issue. Right now, if your
source docs are mucked up, they'll be mucked up when you retrieve
them after searching. If you want to fix that, you preprocess.
Ensuring consistent encoding is the application developer's
responsibility.
If a collection has invalidly encoded text, how does it help to
detect that later rather than sooner?
I *think* that whether it was invalidly encoded or not wouldn't
impact searching -- it doesn't in KinoSearch. It should only affect
display. Detecting invalidly encoded text later doesn't help
anything in and of itself; lifting the requirement that everything be
converted to Unicode early on opens up some options.
Searches will continue to work regardless because the patched
Termbuffer compares raw bytes. (A comparison based on
Term.compareTo () would likely fail because raw bytes translated
to UTF-8 may not produce the same results.)
UTF-8 has the property that bytewise lexicographic order is the
same as Unicode character order.
Yes. I'm suggesting that an unpatched TermBuffer would have problems
with my index with corrupt character data because the sort order by
bytestring may not be the same as sort order by Unicode code point.
However, the patched TermBuffer uses compareBytes() rather than
compareChars(), so TermInfosReader should work fine.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
public final int compareTo(TermBuffer other) {
if (field == other.field) // fields are interned
- return compareChars(text, textLength, other.text,
other.textLength);
+ return compareBytes(bytes, bytesLength, other.bytes,
other.bytesLength);
else
return field.compareTo(other.field);
}
- private static final int compareChars(char[] v1, int len1,
- char[] v2, int len2) {
+ private static final int compareBytes(byte[] bytes1, int len1,
+ byte[] bytes2, int len2) {
int end = Math.min(len1, len2);
for (int k = 0; k < end; k++) {
- char c1 = v1[k];
- char c2 = v2[k];
- if (c1 != c2) {
- return c1 - c2;
+ int b1 = (bytes1[k] & 0xFF);
+ int b2 = (bytes2[k] & 0xFF);
+ if (b1 != b2) {
+ return b1 - b2;
}
}
return len1 - len2;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]