RE: More about SCSU (was: Re: A UTF-8 based News Service)
> > SCSU doesn't look very nice for me. The idea is OK but it's just > > too complicated. Various proposals of encodings differences or xors > > between consecutive characters are IMHO technically better: much > > simpler to implement and work as well. > > These differential schemes seem to be the way IDN > (internationalized domain > names) are headed. They are intended for the limited scope > of domain names > that have already passed through nameprep, which performs > normalization and > further limits the range of allowable characters. I'm not > sure how well the > ACEs would perform with arbitrary Unicode text. I suppose > only testing would > answer that question. Also don't forget they're likely to add some code point reordering. Do we want that too in an alternate scheme? Then is it really that much simpler than SCSU? (Probably; tables for code point reordering are not complex to build. But they may take some effort to optimize, so my guess is the implementation effort may be roughly the same.) YA
Re: A UTF-8 based News Service
From: Kevin Bracey <[EMAIL PROTECTED]> > Much as I love SCSU, and much as my web browser supports it, it's not the > sort of thing to start encouraging on the wire when there are already > existing standards to deal with this. Why not? It can be further compressed by currently existing mechanisms, and if we can exchange dozens of charsets for 3 (UTF-8, UTF-16, SCSU), I'd say that's a good win, since all three of those have their place. -- David Starner - [EMAIL PROTECTED]
Re: A UTF-8 based News Service
From: Keld Jørn Simonsen <[EMAIL PROTECTED]> > UTF-16 is not just 2 bytes, it is sometimes 2 and sometimes 4 bytes. > IETF is recommending UTF-8 as the prime charset in all Internet protocols. Blah. For his purposes, UTF-16 is 2 bytes. The odds his newspaper will have significant quantities of non-BMP in the near future seems very unlikely. Yes, UTF-8 is the prime charset in all Internet protocols; that doesn't mean they can't use others, it just means that the default should be UTF-8, and in some situations, like the dict protocol, where you don't want to mess with charset negotiation, you can hardwire it to UTF-8. In web and email, you have the choice of charsets, and if a non-UTF-8 charset will make your life and the life of your readers easier, go for it. -- David Starner - [EMAIL PROTECTED]
Re: More about SCSU (was: Re: A UTF-8 based News Service)
From: <[EMAIL PROTECTED]> > None as far as I know, which sort of destroys the whole plan. It would sure > be nice if MSIE and Navigator started "quietly" supporting SCSU, in the same > way that they "quietly" (to the average user) began supporting UTF-8. If you want the code in Navigator, write it up for Mozilla and properly submit it, at which point Mozilla will "quietly" start supporting SCSU, and Navigator will follow suit in a couple releases. -- David Starner - [EMAIL PROTECTED]
Re: More about SCSU (was: Re: A UTF-8 based News Service)
> Unfortunately, you don't hear much about SCSU, and in particular the Unicode > Consortium doesn't really seem to promote it much (although they may be > trying to avoid the "too many UTF's" syndrome). Probably that's one point. But also, SCSU is something that's a little more complicated to use, and needs to be pretty well negotiated between sender and receiver. It's much less suitable for general-purpose interchange. Rick
RE: A UTF-8 based News Service
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Raw UTF-8 4,382,592 > Zipped UTF-82,264,152 (52% of raw UTF-8) > Raw SCSU1,179,688 (27% of raw UTF-8) > Zipped SCSU 104,316 (9% of raw SCSU, < 5% of zipped UTF-8) The data set is truly pathological. Since it is in code point order, there are patterns which in it which are probably being exploited. Why not download some of the articles from a certain UTF-8 based news website and run them through the tests? Side note on compression: Specialized compression methods tend to have a bit of maintenance overhead. Generalized compression methods tend, in practice, to be better because they squeeze extra compression out of data that would otherwise not be worth compressing (need more disk space? don't get a specialized compression routine for your biggest file - just compress the whole drive!). The general solution for the internet is IPPC (Internet Protocol Payload Compression), which compresses all IP packets. I am not sure what state of development it is in, but if it is implementable now, I would highly recommend testing its performance on Unicode web data. I expect the results to be comparable to specialized techniques, and you'd get a transparent (IPPC is implemented such that hosts that don't support it are unaffected by it), i.e. low maintenance, solution. /|/|ike
Re: A UTF-8 based News Service
In a message dated 2001-07-13 7:00:26 Pacific Daylight Time, [EMAIL PROTECTED] writes: > Sounds promising! How well does SCSU gzip? If gzip works anything like PKZIP, the answer is, very well indeed. This is because (using the explanation I have heard before) SCSU retargets Unicode text to an 8-bit model, meaning that for small alphabetic scripts (or medium-sized syllabic scripts like Ethiopic), most characters are represented in one byte, so the information appears 8 bits at a time. Many general-purpose compression algorithms are optimized for that kind of data. Recently I created a test file of all Unicode characters in code point order (excluding the surrogates, but including all the other non-characters). I will admit up front that this is a pathological test case and real-world data probably won't behave anywhere near the same. Having said that, here are the file sizes of this data expressed in UTF-8 and SCSU, raw and zipped (using PKZIP 4.0): Raw UTF-8 4,382,592 Zipped UTF-82,264,152 (52% of raw UTF-8) Raw SCSU1,179,688 (27% of raw UTF-8) Zipped SCSU 104,316 (9% of raw SCSU, < 5% of zipped UTF-8) So PKZIP compressed this particular (non-real-world) UTF-8 data by only 48%, but compressed the equivalent SCSU data by a whopping 91%. That's because SCSU puts the data in an 8-bit model, which brings out the best in PKZIP. Gzip may work the same. Note that real-world data would probably be much more useful in making this comparison than my sequential-order data, which certainly favors SCSU as it minimizes windows switches and creates repetitive patterns. Also note that SCSU compressors are different and the same data encoded with a different compressor might yield more or fewer than 1,179,688 bytes. I used my own compressor. -Doug Ewell Fullerton, California
Re: More about SCSU (was: Re: A UTF-8 based News Service)
In a message dated 2001-07-13 4:07:35 Pacific Daylight Time, [EMAIL PROTECTED] writes: > SCSU doesn't look very nice for me. The idea is OK but it's just > too complicated. Various proposals of encodings differences or xors > between consecutive characters are IMHO technically better: much > simpler to implement and work as well. These differential schemes seem to be the way IDN (internationalized domain names) are headed. They are intended for the limited scope of domain names that have already passed through nameprep, which performs normalization and further limits the range of allowable characters. I'm not sure how well the ACEs would perform with arbitrary Unicode text. I suppose only testing would answer that question. -Doug Ewell Fullerton, California
Re: A UTF-8 based News Service
[EMAIL PROTECTED] wrote: > > As a test, I downloaded the first article on the page: > > http://unicode.ethiozena.net/Gazettas/Kibrit/Archives/1993/Hamle/05/Kibrit.051 > 193.sera.html > > The article, dated 1993-05-11, has the formidable title: > Yesterday in the Ethiopian calendar :) > «p-t negaso gidada wedeTalyan kobelelu teblo yeteseraCew zegeba f`Sum Heset > new» yeTalyan Embasi > Titles (in markups) remain transliterated since a number of browsers that support UTF-8 viewing in the page display area do not in the "title" area of the browser's application window. Transliterated Ethiopic actually fairs better than UTF-8 since consonants can be a single byte, syllables 2 bytes and diphthongs 3. On average a document might "compress" with transliteration down to 53%. Not so easy on the eyes though but useful as a last resort. > > Encoded in UTF-8, the file was 1891 bytes long. Converted into SCSU, it > dropped to 1121 bytes, which is 40% shorter than the UTF-8 version, better > than UTF-16, and probably better than any existing legacy encoding for > Ethiopic. SCSU is a Good Thing. Sounds promising! How well does SCSU gzip? /Daniel
Re: More about SCSU (was: Re: A UTF-8 based News Service)
Fri, 13 Jul 2001 03:01:10 EDT, [EMAIL PROTECTED] <[EMAIL PROTECTED]> pisze: > Unfortunately, you don't hear much about SCSU, and in particular > the Unicode Consortium doesn't really seem to promote it much > (although they may be trying to avoid the "too many UTF's" syndrome). SCSU doesn't look very nice for me. The idea is OK but it's just too complicated. Various proposals of encodings differences or xors between consecutive characters are IMHO technically better: much simpler to implement and work as well. -- __("< Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTĘPCZA QRCZAK
Re: A UTF-8 based News Service
On Fri, Jul 13, 2001 at 02:14:25AM +0100, David Starner wrote: > > As someone involved in the service I often wish there was some > > form of "compressed" Unicode encoding. The 3-byte penalty that > > Ethiopic bears under UTF-8 turns into higher bandwidth that web > > hosting services meter and charge for by the megabyte. For a > > popular site this soon makes UTF-8 a costly option to support. > > > > A system analagous to iso-8859-x whereby Ethiopic and other scripts > > in the 3 byte range could be shifted back into the 2 byte range > > might help (generally only English and Ethiopic is desired together). > > > > Fortunately there is mod_gzip for Apache. I would appreciate any > > information about other options. > > What about UTF-16? Encode all characters as 2 bytes, and your problem is > solved, and UTF-16 should be supported by all recent Unicode-supporting web > browsers. UTF-16 is not just 2 bytes, it is sometimes 2 and sometimes 4 bytes. IETF is recommending UTF-8 as the prime charset in all Internet protocols. Kind regards Keld
Re: A UTF-8 based News Service
In message <[EMAIL PROTECTED]> [EMAIL PROTECTED] wrote: > > Encoded in UTF-8, the file was 1891 bytes long. Converted into SCSU, it > dropped to 1121 bytes, which is 40% shorter than the UTF-8 version, better > than UTF-16, and probably better than any existing legacy encoding for > Ethiopic. SCSU is a Good Thing. > Much as I love SCSU, and much as my web browser supports it, it's not the sort of thing to start encouraging on the wire when there are already existing standards to deal with this. Using HTTP transfer encoding to deflate the data being transferred will work well on most browsers, and is implemented by all good webservers. A brief test shows deflate can compress it down to 1027 bytes (although I had the original size as 2201 bytes). -- Kevin Bracey, Principal Software Engineer Pace Micro Technology plc Tel: +44 (0) 1223 518566 645 Newmarket RoadFax: +44 (0) 1223 518526 Cambridge, CB5 8PB, United KingdomWWW: http://www.pace.co.uk/
Re: More about SCSU (was: Re: A UTF-8 based News Service)
In a message dated 2001-07-12 22:55:09 Pacific Daylight Time, [EMAIL PROTECTED] writes: >> SCSU is also registered as an IANA charset, although you are >> unlikely to find >> raw SCSU text on the Internet, due to its use of control >> characters (bytes below 0x20). > > And what browser supports SCSU, and what it that browser's reach in term of > population? Because that's usually what matters to people that publish on > the Internet. None as far as I know, which sort of destroys the whole plan. It would sure be nice if MSIE and Navigator started "quietly" supporting SCSU, in the same way that they "quietly" (to the average user) began supporting UTF-8. Unfortunately, you don't hear much about SCSU, and in particular the Unicode Consortium doesn't really seem to promote it much (although they may be trying to avoid the "too many UTF's" syndrome). -Doug Ewell Fullerton, California
Re: A UTF-8 based News Service
In a message dated 2001-07-12 8:27:20 Pacific Daylight Time, [EMAIL PROTECTED] writes: > The Ethiopian News Headlines has relocated to a new server at > http://www.ethiozena.net/ and is making it easier than ever to > read news headlines in Unicode. A companion Unicode only server > is launched at http://unicode.ethiozena.net/ which serves > articles in UTF-8 encoding only. As a test, I downloaded the first article on the page: http://unicode.ethiozena.net/Gazettas/Kibrit/Archives/1993/Hamle/05/Kibrit.051 193.sera.html The article, dated 1993-05-11, has the formidable title: «p-t negaso gidada wedeTalyan kobelelu teblo yeteseraCew zegeba f`Sum Heset new» yeTalyan Embasi Encoded in UTF-8, the file was 1891 bytes long. Converted into SCSU, it dropped to 1121 bytes, which is 40% shorter than the UTF-8 version, better than UTF-16, and probably better than any existing legacy encoding for Ethiopic. SCSU is a Good Thing. -Doug Ewell Fullerton, California
RE: More about SCSU (was: Re: A UTF-8 based News Service)
> SCSU is also registered as an IANA charset, although you are > unlikely to find > raw SCSU text on the Internet, due to its use of control > characters (bytes > below 0x20). And what browser supports SCSU, and what it that browser's reach in term of population? Because that's usually what matters to people that publish on the Internet. YA
Re: A UTF-8 based News Service
> As someone involved in the service I often wish there was some > form of "compressed" Unicode encoding. The 3-byte penalty that > Ethiopic bears under UTF-8 turns into higher bandwidth that web > hosting services meter and charge for by the megabyte. For a > popular site this soon makes UTF-8 a costly option to support. > > A system analagous to iso-8859-x whereby Ethiopic and other scripts > in the 3 byte range could be shifted back into the 2 byte range > might help (generally only English and Ethiopic is desired together). > > Fortunately there is mod_gzip for Apache. I would appreciate any > information about other options. What about UTF-16? Encode all characters as 2 bytes, and your problem is solved, and UTF-16 should be supported by all recent Unicode-supporting web browsers. -- David Starner - [EMAIL PROTECTED]
More about SCSU (was: Re: A UTF-8 based News Service)
I should have also mentioned that SCSU is fully supported by the programming toolkit ICU (International Components for Unicode), found at: http://oss.software.ibm.com/icu/ An Open Source project, ICU is available for free and comes with voluminous documentation. SCSU is also registered as an IANA charset, although you are unlikely to find raw SCSU text on the Internet, due to its use of control characters (bytes below 0x20). Hope this helps. -Doug Ewell Fullerton, California
Re: A UTF-8 based News Service
In a message dated 2001-07-12 8:27:20 Pacific Daylight Time, [EMAIL PROTECTED] writes: > As someone involved in the service I often wish there was some > form of "compressed" Unicode encoding. The 3-byte penalty that > Ethiopic bears under UTF-8 turns into higher bandwidth that web > hosting services meter and charge for by the megabyte. For a > popular site this soon makes UTF-8 a costly option to support. > > A system analagous to iso-8859-x whereby Ethiopic and other scripts > in the 3 byte range could be shifted back into the 2 byte range > might help (generally only English and Ethiopic is desired together). Today is your lucky day. Check out Unicode Technical Standard #6, "A Standard Compression Scheme for Unicode": http://www.unicode.org/unicode/reports/tr6/ SCSU uses 128-byte windows to compress small alphabetic scripts to almost 1 byte per character. Since Ethiopic occupies three 128-character half-blocks, SCSU must use three windows and switch between them, but the overhead is still much lower than UTF-8. In the worst case (each character belongs to a different half-block than the one before), you will still use only 2 bytes per character. SCSU is fully supported by SC UniPad, a Unicode text editor that is currently available for free. For more information, visit: http://www.unipad.org/ -Doug Ewell Fullerton, California