Re: [fossil-users] Mix of UTF-8 and CP1251 (Russian cyrillic) in project

Owen Shepherd Sat, 26 Jun 2010 09:05:13 -0700

On 26 June 2010 13:47, Michal Suchanek <hramr...@centrum.cz> wrote:
> On 25 June 2010 21:37, Owen Shepherd <owen.sheph...@e43.eu> wrote:
>> On 25 June 2010 19:36, Michal Suchanek <hramr...@centrum.cz> wrote:
>>> On 25 June 2010 20:18, Owen Shepherd <owen.sheph...@e43.eu> wrote:
>>>> One of the reasons that I'm a fan of SCSU is that, with even a
>>>> relatively simple encoder, it produces output which is comparable in
>>>> efficiency to that of most legacy encodings.
>>>
>>> SCSU is a horrendous encoding because it uses shifts. When the shift
>>> is lost the text has completely different meaning. In UTF-8 if you
>>> remove part of the text only that part is affected (if you cut
>>> mid-character you create a bad character at worst but it can be
>>> clearly detected).
>>
>> And how often do you lose a couple of bytes in the middle of a file?
>> More precisely, how often do you lose them and not have a checksum
>> fail (or some other error) notifying you of this?
>
> If the file is a web page then quite often, and it does not have a checksum.


In that case I'd have to question the quality of your networking
equipment and software. Losing a couple of bytes in the middle of a
web page is something that should not be possible under TCP (Unless,
perhaps, one is under attack from a malicious 3rd party, in which case
a bit of data loss is the least of your worries).

And HTML is also a file format with the equivalent of shifts; it just
calls them tags.

> If the encoding is intended solely for storage then anything that is
> easy to work with would do and SCSU does not seem to particularly
> shine in that area, not compared to more well-known and widespread
> encodings for which tools are more readily available.

When embedded inside some other file format (Such as a Fossil
repository, this is a non issue)

>>
>> It's a particularly egregious complaint in the context of Fossil -
>> where all records are hashed anyway! Additionally, if the same kind of
>> error were to occur to the SQLite file that the repository is
>> contained within, it would probably be trashed irretrievably.
>>
>> Years of experience with binary and other modal file formats (XML and
>> HTML to name two very common) show that this is a complete non-issue.
>
> It is not an issue if the partial data still makes sense which is not
> the case with SCSU shifts which completely change the meaning of the
> rest of the data.
>

And yet we are discussing here Fossil - where the loss of a few bytes
will destroy the repository or abort the sync operation anyway.

>>
>> SCSU is of course a poor choice for an in-memory format (Use UTF-16)
>> or interacting with the console (For backwards compatibility you're
>> probably going to have to use UTF-8). But for a storage format,
>> particularly one embedded within a database? It's pretty much perfect.
>
> Anybody who suggests to use UTF-16 for anything has no idea about
> useful encodings in my book. UTF-16 has no advantage whatsoever, only
> disadvantages.

Would you care to enumerate your points then?

> SCSU is not that useful for storage compression since fossil already
> uses zlib and it has no other advantages I am aware of.

Deflate compression is only applied to commits. Deflate has
significant overhead, and is inapplicable to smaller pieces of text
(such as commit strings) which can non-the-less contribute
significantly to size. On the other hand, SCSU performs better than
UTF-8 for the vast majority of real world texts, as has already been
enumerated.
_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Re: [fossil-users] Mix of UTF-8 and CP1251 (Russian cyrillic) in project

Reply via email to