[sword-devel] clucene UTF-8 to wchar_t * (or TCHAR *) (was: 1.6.1 final call)

Troy A. Griffitts Thu, 24 Dec 2009 23:01:28 -0800

It looks like the lucene_utf8towcs method we are using from
CLucene/config/utf8.cpp was provided by RedHat and probably had no
intention to work on Windows:

http://www.google.com/codesearch/p?hl=en#7HljlF5wh14/trunk/clucene-core-0.9.21/src/CLucene/config/utf8.cpp&q=lucene_utf8towcs&d=2

We don't need to use these conversion routines if we knew exactly what
clucene wants in a TCHAR *.  UTF-32?  We have methods in SWORD to do
conversions without requiring a static buffer.

If I knew exactly what clucene expected in the Field c-tor then I could
convert the buffer with one of our methods and supply it to clucene.

If I understand things correctly, Win32 has historically defined wchar_t
to 16 bits because their 'w' methods take UCS-2 (Windows 2000) or UTF-16
(later).

After examining the lucene_utf8towcs method (and consequently the
lucene_utf8towc method) impls, it looks like it can only return a single
wchar_t for a UTF-8 encoded character.  This means that it cannot be
proper UTF-16 for Windows (never multi-wchar_t) (unless I am missing
something).

In SWORD we never use wchar_t for this reason-- it is ambiguous.  When
support was added to SWORD for clucene, clucene's methods took both
wchar_t (lucene_utf8towcs) and TCHAR types.  I am not sure the
difference but hope they eventually become the same thing on the same
platform.

Since clucene provides its own conversion methods for us from UTF-8 to,
presumably, whatever clucene ultimately wants, we used them so we didn't
have to know what encoding clucene ultimately wanted.

If it were up to me, I would replace all wchar_t types in clucene with
TCHAR and define TCHAR as int32_t or equiv for all platforms, and remove
all ambiguity.  However, that is not up to me.

>From my brief look at the code, I would guess that the current state of
Unicode in clucene is thus:

It supports conversion of UTF-8 to a 32-bit Unicode character stream on
linux (and other platforms that define wchar_t to 32 bits) just fine.

It will simply not work on Windows for values greater than 16-bit.

My support of this conclusion is from the impl of this method:

size_t lucene_utf8towc(wchar_t *pwc, const char *p, size_t n)
{
  int i, mask = 0;
  int result;
  unsigned char c = (unsigned char) *p;
  int len=0;

  UTF8_COMPUTE (c, mask, len);
  if (len == -1)
    return 0;
  UTF8_GET (result, p, i, mask, len);

  *pwc = result;
  return len;
}

Notice that it assigns to *pwc (wchar_t) the value of result (int).

Not sure what we should do about this.

We can use our methods to convert UTF-8 to UTF-32 (a.k.a. UCS-4) and
send that to clucene, which should work fine for clucene on systems that
define wchar_t to 32-bit, but will fail miserably on Windows.

Maybe we can get the clucene folks opinion on this?  Maybe I've
completely misunderstood the situation; otherwise, maybe we can offer to
clean this up for them.

Troy

Matthew Talbert wrote:
> OK, I am still not understanding why there is an issue, or what the
> real cause of the issue is. However, this line I think will work:
>
> const unsigned int MAX_CONV_SIZE = 6536 * sizeof(wchar_t) * sizeof(wchar_t);
>
> If somebody can come up with an actual explanation for why there is a
> problem, and a non-hackish solution, that would be great.
>
> Just for the record, wchar_t is 16 bits on win32 and 32 bits on *nix.
> So, if I'm thinking correctly (and I won't guarantee that right now),
> this should give the equivalent of 1024 * 1024;
>
> Matthew
>
> _______________________________________________
> sword-devel mailing list: sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>   

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

[sword-devel] clucene UTF-8 to wchar_t * (or TCHAR *) (was: 1.6.1 final call)

Reply via email to