I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8?
I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count. -Yonik Now hiring -- http://tinyurl.com/7m67g On 8/30/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > > [EMAIL PROTECTED] wrote: > > How will the difference impact String memory allocations? Looking at > > the String code, I can't see where it would make an impact. > > I spoke a bit too soon. I should have looked at the code first. You're > right, I don't think it would require more allocations. > > When considering this byte-count versus character-count issue please > note that it also arises elsewhere. The PrefixLength in the Term > Dictionary section of the file format document is currently defined as a > number of characters, not bytes. > > http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary > > Implementing this in terms of bytes may have performance implications, > since, at first glance, the entire byte sequence would need to be > converted from UTF-8 into the internal string representation for each > term, rather than just the suffix. Does anyone see a way around that? > > As for how we got to this point: I wrote Lucene's UTF-8 reading and > writing code in 1998, back when Unicode still had fewer than 2^16 > characters. It's surprising that it has lasted this long without anyone > noticing! > > Doug > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED]
