Abdelrazak Younes wrote:
> Georg Baum wrote:
>> That would be conceptually wrong. If you convert a given UCS4 character
>> into an eightbit encoding you never know whether the result will be only
>> one character, not even in fixed width encodings. For example the single
>> byte fixed width encoding iso_8859-7 has two modifier letters: REVERSED
>> COMMA and APOSTROPHE. Therefore a single UCS4 character can result in two
>> iso_8859-7 characters.
>
> If that is true, then we have a problem in Encoding::init() because we
> only test for the first 256 character for fixed width encodings.
Right. I overlooked that case.
>> I believe that I once read about an encoding that needs more than 4 bytes
>> for one code point, but am not 100% sure. Since it does not cost anything
>> to support such a beast it should be supported IMHO.
>
> OK.
Note that the test in my patch might be incorrect, so that it does not come
for free.
if (bytes >= 0)
out.resize(bytes);
else if (errno == E2BIG)
// Use unoptimized version.
// Does only happen for exotic encodings
out = ucs4_to_eightbit(&ucs4, 1, encoding);
else
out.clear();
should be better.
> So, I will think a bit more about this and try to find a correct
> solution for 1.5.0. Right now, the simplest solution I can think of is
> to generate the correspondence table between ucs4 and the different
> encodings using iconv and distribute that.
I also thought of that. I don't really like this solution, because not all
iconv implementations behave alike (was discussed in bugzilla, but I forgot
the number), so a table that is valid for one implementation does not need
to be valid for another.
Another possibility that avoids this problem is to define the maximum UCS4
code point (and maybe minimum, too) for each encoding in lib/encodings. I
guess that this would speed up the table generation a lot, since all the
exotic code points do not need to be tested.
> Or generate them on first use
> in Encoding::init().
??? This happens!
Georg