Re: Hacking Luke for bytecount-based strings

Marvin Humphrey Wed, 17 May 2006 13:33:50 -0700


On May 17, 2006, at 11:08 AM, Doug Cutting wrote:

Marvin Humphrey wrote:
What I'd like to do is augment my existing patch by making itpossible to specify a particular encoding, both for Lucene and Luke.
What ensures that all documents in fact use the same encoding?

In KinoSearch at this moment, zilch. Lucene would still need to readstuff into Java chars then write it out using the specifiedencoding. If we opt for output buffering rather than output counting(the patch currently does counting, but that would have to change ifwe're flexible about encoding in the index), then string.getBytes(encoding) would guarantee it.

The current approach of converting everything to Unicode and thenwriting UTF-8 to indexes makes indexes portable and simplifies theconstruction of search user interfaces, since only indexing codeneeds to know about other character sets and encodings.

Sure. OTOH, it's not so good for CJK users. I also opted against itin KinoSearch because A) compatibility with the current Java Lucenefile format wasn't going to happen anyway, and B) not all Perlers useor require valid UTF-8. I've considered adding a UTF8EnforcerAnalyzer subclass, but it hasn't been an issue. Right now, if yoursource docs are mucked up, they'll be mucked up when you retrievethem after searching. If you want to fix that, you preprocess.Ensuring consistent encoding is the application developer'sresponsibility.

If a collection has invalidly encoded text, how does it help todetect that later rather than sooner?

I *think* that whether it was invalidly encoded or not wouldn'timpact searching -- it doesn't in KinoSearch. It should only affectdisplay. Detecting invalidly encoded text later doesn't helpanything in and of itself; lifting the requirement that everything beconverted to Unicode early on opens up some options.

Searches will continue to work regardless because the patchedTermbuffer compares raw bytes. (A comparison based onTerm.compareTo () would likely fail because raw bytes translatedto UTF-8 may not produce the same results.)
UTF-8 has the property that bytewise lexicographic order is thesame as Unicode character order.

Yes. I'm suggesting that an unpatched TermBuffer would have problemswith my index with corrupt character data because the sort order bybytestring may not be the same as sort order by Unicode code point.However, the patched TermBuffer uses compareBytes() rather thancompareChars(), so TermInfosReader should work fine.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

   public final int compareTo(TermBuffer other) {
     if (field == other.field)                    // fields are interned

- return compareChars(text, textLength, other.text,other.textLength);+ return compareBytes(bytes, bytesLength, other.bytes,other.bytesLength);

     else
       return field.compareTo(other.field);
   }

-  private static final int compareChars(char[] v1, int len1,
-                                        char[] v2, int len2) {
+  private static final int compareBytes(byte[] bytes1, int len1,
+                                        byte[] bytes2, int len2) {
     int end = Math.min(len1, len2);
     for (int k = 0; k < end; k++) {
-      char c1 = v1[k];
-      char c2 = v2[k];
-      if (c1 != c2) {
-        return c1 - c2;
+      int b1 = (bytes1[k] & 0xFF);
+      int b2 = (bytes2[k] & 0xFF);
+      if (b1 != b2) {
+        return b1 - b2;
       }
     }
     return len1 - len2;
   }






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hacking Luke for bytecount-based strings

Reply via email to