It looks like the lucene_utf8towcs method we are using from CLucene/config/utf8.cpp was provided by RedHat and probably had no intention to work on Windows:
http://www.google.com/codesearch/p?hl=en#7HljlF5wh14/trunk/clucene-core-0.9.21/src/CLucene/config/utf8.cpp&q=lucene_utf8towcs&d=2 We don't need to use these conversion routines if we knew exactly what clucene wants in a TCHAR *. UTF-32? We have methods in SWORD to do conversions without requiring a static buffer. If I knew exactly what clucene expected in the Field c-tor then I could convert the buffer with one of our methods and supply it to clucene. If I understand things correctly, Win32 has historically defined wchar_t to 16 bits because their 'w' methods take UCS-2 (Windows 2000) or UTF-16 (later). After examining the lucene_utf8towcs method (and consequently the lucene_utf8towc method) impls, it looks like it can only return a single wchar_t for a UTF-8 encoded character. This means that it cannot be proper UTF-16 for Windows (never multi-wchar_t) (unless I am missing something). In SWORD we never use wchar_t for this reason-- it is ambiguous. When support was added to SWORD for clucene, clucene's methods took both wchar_t (lucene_utf8towcs) and TCHAR types. I am not sure the difference but hope they eventually become the same thing on the same platform. Since clucene provides its own conversion methods for us from UTF-8 to, presumably, whatever clucene ultimately wants, we used them so we didn't have to know what encoding clucene ultimately wanted. If it were up to me, I would replace all wchar_t types in clucene with TCHAR and define TCHAR as int32_t or equiv for all platforms, and remove all ambiguity. However, that is not up to me. >From my brief look at the code, I would guess that the current state of Unicode in clucene is thus: It supports conversion of UTF-8 to a 32-bit Unicode character stream on linux (and other platforms that define wchar_t to 32 bits) just fine. It will simply not work on Windows for values greater than 16-bit. My support of this conclusion is from the impl of this method: size_t lucene_utf8towc(wchar_t *pwc, const char *p, size_t n) { int i, mask = 0; int result; unsigned char c = (unsigned char) *p; int len=0; UTF8_COMPUTE (c, mask, len); if (len == -1) return 0; UTF8_GET (result, p, i, mask, len); *pwc = result; return len; } Notice that it assigns to *pwc (wchar_t) the value of result (int). Not sure what we should do about this. We can use our methods to convert UTF-8 to UTF-32 (a.k.a. UCS-4) and send that to clucene, which should work fine for clucene on systems that define wchar_t to 32-bit, but will fail miserably on Windows. Maybe we can get the clucene folks opinion on this? Maybe I've completely misunderstood the situation; otherwise, maybe we can offer to clean this up for them. Troy Matthew Talbert wrote: > OK, I am still not understanding why there is an issue, or what the > real cause of the issue is. However, this line I think will work: > > const unsigned int MAX_CONV_SIZE = 6536 * sizeof(wchar_t) * sizeof(wchar_t); > > If somebody can come up with an actual explanation for why there is a > problem, and a non-hackish solution, that would be great. > > Just for the record, wchar_t is 16 bits on win32 and 32 bits on *nix. > So, if I'm thinking correctly (and I won't guarantee that right now), > this should give the equivalent of 1024 * 1024; > > Matthew > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page