Re: [Wt-interest] How to set encoding

Wim Dumon Wed, 24 Feb 2010 02:49:56 -0800

Markus,

Your interpretation of solution 2 is definitely not how it is intended
to be used.
On the other hand, your mail inspired me to investigate some of Wt's
string conversion methods and a patch is on it's way to the git
repository.


The one and only important point wrt internationalization and Wt is:
once your data is correctly stored in (or converted to) a WString, Wt
knows how it is encoded and what every character means. The string
will then be rendered correctly in the browser. (internally, a WString
stores data in UTF-8 format, but that could be anything) But how do
you get your characters correctly into a WString? There's a number of
ways.

- WString::WString(std::wstring) or WString::WString(wchar_t *)
For starters: wchar_t and std::wstring are not portable across
compilers. They're 2 byte on Windows, 4 byte on Unix. 2 bytes are not
enough to cover all characters, so on windows you may still have
multiple bytes to encode a single character on exceptional occasions.
WString converts a std::wstring or a wchar_t immediately into a UTF-8
string, using boost's conversion functions.
You can use this WString constructor if you have a properly
constructed std::wstring (e.g. a database library, from reading a
file, ...) and want to display it, but also with the L"blah" notation,
if you're sure that the compiler uses the right character encoding of
your file. A correct wstring becomes a correct WString. It is not
always possible to convert a WString to a std::wstring without losses,
as wchar_t is compiler-dependend and may be as small as 8 bit. Use
WString::widen() to convert a WString to a std::wstring.

- WString::WString(std::string, encoding) and WString::WString(char *, encoding)
'encoding' is either UTF-8 or LocalEncoding.
(a) UTF-8
Tells WString that the char * uses the UTF-8 (multibyte) encoding. The
string is stored without modification. Perfectly safe to use.
(b) LocalEncoding
This is the most error prone way to convert internationalized strings
to a WString. It is best to only use these methods when your char* or
string does not contain internationalized symbols. WString will
convert the string parameter to UTF-8, after widening it. Current
versions of Wt (i.e. 3.1.1) use an unspecified (and buggy) method. We
should probably use the C++ global locale to do the conversion (I'll
submit a patch for this soon). The global locale can be set by
standard c++ methods, e.g. by calling
std::locale::global(std::locale("")), which reads the locale from the
environment. I'm not sure if this is a good idea, as you'll understand
that it is not desirable to have WString("string with weird
characters") interpreted differently depending on the environment in
which it's executed... But if you don't set it, which is then the
default locale for every compiler/OS combination? I leave it as an
exercise to the reader.
On Linux, this usually works because UTF-8 encoding is generally used.

- WString::fromUTF8(std::string) and WString::fromUTF8(char *)
Conventient shorthand for the previous constructor with UTF8 encoding
parameter. Perfectly safe to use if your string is UTF-8 formatted.

- WString::tr(const char *key) and WString::tr(std::string key)
tr() is intended to facilitate development of intnerationalized
applications (= 1 app in many languages, configurable at runtime, per
session). It looks up the key in a map/database, and replace it with
the translation of that key, according to the locale that was set by
WApplication::setLocale(). Note that WApplication::setLocale() has
nothing to do with string encodings! It specifies to what language the
tr(key) should be translated.
The default method to handle tr() is to look it up in a 'message
bundle'. This is an XML file, which contains mappings of the key to
their translation. As an XML file specifies its locale, there is no
discussion about the meaning of a character (Wt support UTF-8 and
UTF-16 as XML encodings). If the key is not found, Wt will render the
key with two question marks in front and behind it. For example:
??button.ok?? (the .xml file should map button.ok to Jäwhöl or
whatever).
There is no reason to use non-ASCII characters in your keys; if you
do, you're back in the
what-encoding-for-my-C-file-does-my-compiler-assume game, which you
want to avoid for portable C source files.

So what are my recommendations wrt string literals with non-ascii
characters in C++ files...
1. Don't do it. Use Wt's tr() mechanism to avoid the encoding mess
completely. Use a pure ASCII key within the tr() (note that I wrote
tr("Frankfurt") without umlaut on the u, that was not a typo). Use
WApplication::messageResourceBundle() and store all encoded strings in
a properly formatted external XML file. XML files do specify their
encoding, a C file does not.
2. If you can't resist, use L"íntèrnätïñonal Ç++ string". But then you
have to ensure that the source file encoding assumed by your C
compiler (could be UTF-8, ISO 8859-1, ...) corresponds to the actual
encoding of the C file.


To come back to your other questions:
- What if you are reading strings from a file, a db, ...
You must absolutely know the encoding of the string that you retrieve
from the file, db, ... I strongly recommend to save your files encoded
in UTF-8, to configure your database to return UTF-8, ... If not, use
boost, iconv, or std c++ methods to convert the returned string to
UTF-8, and WString::fromUTF8() when you use it in Wt. (Alternatively,
convert it to std::wstring but remember the implications on
portability).
- If you're constructing WStrings from literal international strings,
WString(char *, LocalEncoding) is no good for you (unless you
correctly configured the C++ global locale). Try to avoid it, but if
you can't resist, use WString(L"bläh")

Regarding your last example:
> std::wstring s = L"Frankf?rt";
> WString x = WString(s);// now the 'utf8_' member of WString seems to be
> translated
> std::string s1 = x.narrow(); // now the conversion gets lost...
.narrow() is implemented like this:
std::string narrow(const std::wstring& s)
{
  return std::string(s.begin(), s.end());
}
That is a bug and I fixed it. Widen was buggy in a similar way
(especially on windows, where char is signed); I fixed that too.
Locales in C++ always scared me a bit, but I never expected that the
end result would look so simple...
for (std::wstring::const_iterator i = s.begin(); i != s.end(); ++i)
    retval += std::use_facet<std::ctype<wchar_t> >(loc).narrow(*i, '?');

While fixing the bugs, I extended WString's interface:
- Added WString(char*/string, std::locale)
- Added locale parameter to Wt::narrow() and Wt::widen()
So there's a new method to create a string from a char *, where you
can specify your favorite std::locale which has to be used to
interpret your chars. If you know what you're doing, this will also
result in perfectly constructed WStrings.

I hope this clarifies a bit.

Best regards,
Wim.

2010/2/23 Markus Quatember <[email protected]>:
> Hi Wt-Community!
>
> I had the same problem like Jiongliang and Wim sent me the two Quick
> fixes.
>
> First of all: The fixes work fine ;-)
> But I didn't make friends with them so far...
>
> I will try to explain my problems:
>
>> Quick fix 1: try WString(L"Frankf?rt") (and make sure your compiler
> speaks your .cpp file's locale)
> Ok for constructing WString from literals but what if you are reading
> the Strings from an variable (DB, File,...)
> If I am constructing WString from literals I anyway must change every
> call from WString("...") to WString(L"...") :(
>
>> Quick fix 2: try WString(tr("Frankf?rt")) and put Frankf?rt in a
> message bundle
> For me this is the fix that's better than the first because it is more
> general.
> Nevertheless I have to change all calls from WString(x) to
> WString(tr(x))...
> I derived from Wt::WLocalizedStrings, did following MS-Win specific and
> called WApplication::setLocalizedStrings(...):
>
> class ConvertStrings : public Wt::WLocalizedStrings
> {
> protected:
>  virtual bool resolveKey( const std::string& key, std::string& result )
> override
>  {
>    if( key.empty() )
>      return true;
>
>    std::vector< wchar_t > w;
>    w.resize( key.size() * sizeof( wchar_t ) * 2 );
>    MultiByteToWideChar( CP_ACP, 0, key.c_str(), -1, &w[ 0 ], w.size()
> );
>
>    vector< char > s;
>    s.resize( w.size() );
>    WideCharToMultiByte( CP_UTF8, 0, &w[ 0 ], -1, &s[ 0 ], s.size(), 0,
> 0 );
>
>    result = &s[ 0 ];
>    return true;
>  }
> };
>
> But now the problems start, because i cannot recognize if the 'key' is
> already translated or not!
> So following will get me in troubles:
>
> std::string s1 = "Frankf?rt";
> Wt::WString w1(tr( s )); // OK
> std::string s2 = w1.narrow();
> Wt::WString w2(tr( s2 )); // Bad, because s2 is already translated
>
> So my question is:
> Is WString and it's conversion operations in the constructors ok, or am
> I missing something essential?
> Think about following:
>
> std::wstring s = L"Frankf?rt";
> WString x = WString(s);// now the 'utf8_' member of WString seems to be
> translated
> std::string s1 = x.narrow(); // now the conversion gets lost...
>
> best regards
> Max
>
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> witty-interest mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/witty-interest
>

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
witty-interest mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/witty-interest

Re: [Wt-interest] How to set encoding

Reply via email to