Re: [fossil-users] Mix of UTF-8 and CP1251 (Russian cyrillic) in project

Michal Suchanek Sat, 26 Jun 2010 13:00:00 -0700

On 26 June 2010 18:05, Owen Shepherd <owen.sheph...@e43.eu> wrote:
> On 26 June 2010 13:47, Michal Suchanek <hramr...@centrum.cz> wrote:
>> On 25 June 2010 21:37, Owen Shepherd <owen.sheph...@e43.eu> wrote:
>>> On 25 June 2010 19:36, Michal Suchanek <hramr...@centrum.cz> wrote:
>>>> On 25 June 2010 20:18, Owen Shepherd <owen.sheph...@e43.eu> wrote:
>>>>> One of the reasons that I'm a fan of SCSU is that, with even a
>>>>> relatively simple encoder, it produces output which is comparable in
>>>>> efficiency to that of most legacy encodings.
>>>>
>>>> SCSU is a horrendous encoding because it uses shifts. When the shift
>>>> is lost the text has completely different meaning. In UTF-8 if you
>>>> remove part of the text only that part is affected (if you cut
>>>> mid-character you create a bad character at worst but it can be
>>>> clearly detected).
>>>
>>> And how often do you lose a couple of bytes in the middle of a file?
>>> More precisely, how often do you lose them and not have a checksum
>>> fail (or some other error) notifying you of this?
>>
>> If the file is a web page then quite often, and it does not have a checksum.
>
> In that case I'd have to question the quality of your networking
> equipment and software. Losing a couple of bytes in the middle of a
> web page is something that should not be possible under TCP (Unless,
> perhaps, one is under attack from a malicious 3rd party, in which case
> a bit of data loss is the least of your worries).


Indeed, the loss is at the end in case of web pages, parts which are
missing in the middle are result of inserting different streams so
SCSU would not suffer more breakage than other encodings. Still there
is no apparent benefit in using it.

>
> And HTML is also a file format with the equivalent of shifts; it just
> calls them tags.

However, most HTML parsers are very well capable of parsing incomplete
HTML because the tags don't change the meaning of text except when it
is part of tag attribute.

>>>
>>> SCSU is of course a poor choice for an in-memory format (Use UTF-16)
>>> or interacting with the console (For backwards compatibility you're
>>> probably going to have to use UTF-8). But for a storage format,
>>> particularly one embedded within a database? It's pretty much perfect.
>>
>> Anybody who suggests to use UTF-16 for anything has no idea about
>> useful encodings in my book. UTF-16 has no advantage whatsoever, only
>> disadvantages.
>
> Would you care to enumerate your points then?
>

UTF-8 is endianness independent and null-free, UTF-16 is not. In
transport losing a byte (or a packet with unknown, possibly odd number
of bytes) may corrupt at most one character of UTF-8, it may misalign
the whole stream of UTF-16.

UTF-32 is dword aligned, you can index into it as an array and every
position is a codepoint. UTF-16 has surrogate pairs so you have to
decode the whole string to get at codepoints.

I know no language for which UTF-16 is storage-efficient. For
languages using Latin UTF-8 or legasy encodings are about twice as
efficient. For Cyrrilic legacy encodings are much more efficient, I
don't know how UTF-16 compares to UTF-8 here. For CJK UTF-16 is about
2/3 of UTF-8 but more efficient alternative encodings exist and are in
widespread use.

If you know any advantage of UTF-16 then please enlighten me.

Thanks

Michal
_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Re: [fossil-users] Mix of UTF-8 and CP1251 (Russian cyrillic) in project

Reply via email to