RE: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-21 Thread Yves Arrouye

   SCSU doesn't look very nice for me. The idea is OK but it's just
   too complicated. Various proposals of encodings differences or xors
   between consecutive characters are IMHO technically better: much
   simpler to implement and work as well.
 
 These differential schemes seem to be the way IDN 
 (internationalized domain 
 names) are headed.  They are intended for the limited scope 
 of domain names 
 that have already passed through nameprep, which performs 
 normalization and 
 further limits the range of allowable characters.  I'm not 
 sure how well the 
 ACEs would perform with arbitrary Unicode text.  I suppose 
 only testing would 
 answer that question.

Also don't forget they're likely to add some code point reordering. Do we
want that too in an alternate scheme? Then is it really that much simpler
than SCSU? (Probably; tables for code point reordering are not complex to
build. But they may take some effort to optimize, so my guess is the
implementation effort may be roughly the same.)

YA




RE: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Yves Arrouye

 SCSU is also registered as an IANA charset, although you are 
 unlikely to find 
 raw SCSU text on the Internet, due to its use of control 
 characters (bytes 
 below 0x20).

And what browser supports SCSU, and what it that browser's reach in term of
population? Because that's usually what matters to people that publish on
the Internet.

YA
 




Re: A UTF-8 based News Service

2001-07-13 Thread DougEwell2

In a message dated 2001-07-12 8:27:20 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  The Ethiopian News Headlines has relocated to a new server at
  http://www.ethiozena.net/ and is making it easier than ever to
  read news headlines in Unicode.  A companion Unicode only server
  is launched at http://unicode.ethiozena.net/ which serves
  articles in UTF-8 encoding only.

As a test, I downloaded the first article on the page:

http://unicode.ethiozena.net/Gazettas/Kibrit/Archives/1993/Hamle/05/Kibrit.051

193.sera.html

The article, dated 1993-05-11, has the formidable title:

«p-t negaso gidada wedeTalyan kobelelu teblo yeteseraCew zegeba f`Sum Heset 
new» yeTalyan Embasi

Encoded in UTF-8, the file was 1891 bytes long.  Converted into SCSU, it 
dropped to 1121 bytes, which is 40% shorter than the UTF-8 version, better 
than UTF-16, and probably better than any existing legacy encoding for 
Ethiopic.  SCSU is a Good Thing.

-Doug Ewell
 Fullerton, California




Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread DougEwell2

In a message dated 2001-07-12 22:55:09 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

 SCSU is also registered as an IANA charset, although you are 
 unlikely to find 
 raw SCSU text on the Internet, due to its use of control 
 characters (bytes below 0x20).

  And what browser supports SCSU, and what it that browser's reach in term of
  population? Because that's usually what matters to people that publish on
  the Internet.

None as far as I know, which sort of destroys the whole plan.  It would sure 
be nice if MSIE and Navigator started quietly supporting SCSU, in the same 
way that they quietly (to the average user) began supporting UTF-8.

Unfortunately, you don't hear much about SCSU, and in particular the Unicode 
Consortium doesn't really seem to promote it much (although they may be 
trying to avoid the too many UTF's syndrome).

-Doug Ewell
 Fullerton, California




Re: A UTF-8 based News Service

2001-07-13 Thread Kevin Bracey

In message [EMAIL PROTECTED]
  [EMAIL PROTECTED] wrote:

 
 Encoded in UTF-8, the file was 1891 bytes long.  Converted into SCSU, it 
 dropped to 1121 bytes, which is 40% shorter than the UTF-8 version, better 
 than UTF-16, and probably better than any existing legacy encoding for 
 Ethiopic.  SCSU is a Good Thing.
 

Much as I love SCSU, and much as my web browser supports it, it's not the
sort of thing to start encouraging on the wire when there are already
existing standards to deal with this.

Using HTTP transfer encoding to deflate the data being transferred will work
well on most browsers, and is implemented by all good webservers. A brief
test shows deflate can compress it down to 1027 bytes (although I had the
original size as 2201 bytes).

-- 
Kevin Bracey, Principal Software Engineer
Pace Micro Technology plc Tel: +44 (0) 1223 518566
645 Newmarket RoadFax: +44 (0) 1223 518526
Cambridge, CB5 8PB, United KingdomWWW: http://www.pace.co.uk/




Re: A UTF-8 based News Service

2001-07-13 Thread Keld Jørn Simonsen

On Fri, Jul 13, 2001 at 02:14:25AM +0100, David Starner wrote:
  As someone involved in the service I often wish there was some
  form of compressed Unicode encoding.  The 3-byte penalty that
  Ethiopic bears under UTF-8 turns into higher bandwidth that web
  hosting services meter and charge for by the megabyte.  For a
  popular site this soon makes UTF-8 a costly option to support.
 
  A system analagous to iso-8859-x whereby Ethiopic and other scripts
  in the 3 byte range could be shifted back into the 2 byte range
  might help (generally only English and Ethiopic is desired together).
 
  Fortunately there is mod_gzip for Apache.  I would appreciate any
  information about other options.
 
 What about UTF-16? Encode all characters as 2 bytes, and your problem is
 solved, and UTF-16 should be supported by all recent Unicode-supporting web
 browsers.

UTF-16 is not just 2 bytes, it is sometimes 2 and sometimes 4 bytes.
IETF is recommending UTF-8 as the prime charset in all Internet protocols.

Kind regards
Keld




Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Marcin 'Qrczak' Kowalczyk

Fri, 13 Jul 2001 03:01:10 EDT, [EMAIL PROTECTED] [EMAIL PROTECTED] pisze:

 Unfortunately, you don't hear much about SCSU, and in particular
 the Unicode Consortium doesn't really seem to promote it much
 (although they may be trying to avoid the "too many UTF's" syndrome).

SCSU doesn't look very nice for me. The idea is OK but it's just
too complicated. Various proposals of encodings differences or xors
between consecutive characters are IMHO technically better: much
simpler to implement and work as well.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK





Re: A UTF-8 based News Service

2001-07-13 Thread Daniel Yacob

[EMAIL PROTECTED] wrote:
 
 As a test, I downloaded the first article on the page:
 
 http://unicode.ethiozena.net/Gazettas/Kibrit/Archives/1993/Hamle/05/Kibrit.051 
 193.sera.html
 
 The article, dated 1993-05-11, has the formidable title:
 

Yesterday in the Ethiopian calendar :) insert favorite Y2K joke here



 «p-t negaso gidada wedeTalyan kobelelu teblo yeteseraCew zegeba f`Sum Heset
 new» yeTalyan Embasi
 

Titles (in title markups) remain transliterated since a number of
browsers
that support UTF-8 viewing in the page display area do not in the
title area
of the browser's application window.  Transliterated Ethiopic actually
fairs
better than UTF-8 since consonants can be a single byte, syllables 2
bytes
and diphthongs 3.  On average a document might compress with
transliteration
down to 53%.  Not so easy on the eyes though but useful as a last
resort.



 Encoded in UTF-8, the file was 1891 bytes long.  Converted into SCSU, it
 dropped to 1121 bytes, which is 40% shorter than the UTF-8 version, better
 than UTF-16, and probably better than any existing legacy encoding for
 Ethiopic.  SCSU is a Good Thing.


Sounds promising!  How well does SCSU gzip?

/Daniel




Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread DougEwell2

In a message dated 2001-07-13 4:07:35 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  SCSU doesn't look very nice for me. The idea is OK but it's just
  too complicated. Various proposals of encodings differences or xors
  between consecutive characters are IMHO technically better: much
  simpler to implement and work as well.

These differential schemes seem to be the way IDN (internationalized domain 
names) are headed.  They are intended for the limited scope of domain names 
that have already passed through nameprep, which performs normalization and 
further limits the range of allowable characters.  I'm not sure how well the 
ACEs would perform with arbitrary Unicode text.  I suppose only testing would 
answer that question.

-Doug Ewell
 Fullerton, California




Re: A UTF-8 based News Service

2001-07-13 Thread DougEwell2

In a message dated 2001-07-13 7:00:26 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  Sounds promising!  How well does SCSU gzip?

If gzip works anything like PKZIP, the answer is, very well indeed.  This is 
because (using the explanation I have heard before) SCSU retargets Unicode 
text to an 8-bit model, meaning that for small alphabetic scripts (or 
medium-sized syllabic scripts like Ethiopic), most characters are represented 
in one byte, so the information appears 8 bits at a time.  Many 
general-purpose compression algorithms are optimized for that kind of data.

Recently I created a test file of all Unicode characters in code point order 
(excluding the surrogates, but including all the other non-characters).  I 
will admit up front that this is a pathological test case and real-world data 
probably won't behave anywhere near the same.  Having said that, here are the 
file sizes of this data expressed in UTF-8 and SCSU, raw and zipped (using 
PKZIP 4.0):

Raw UTF-8   4,382,592
Zipped UTF-82,264,152 (52% of raw UTF-8)
Raw SCSU1,179,688 (27% of raw UTF-8)
Zipped SCSU   104,316 (9% of raw SCSU,  5% of zipped UTF-8)

So PKZIP compressed this particular (non-real-world) UTF-8 data by only 48%, 
but compressed the equivalent SCSU data by a whopping 91%.  That's because 
SCSU puts the data in an 8-bit model, which brings out the best in PKZIP.  
Gzip may work the same.

Note that real-world data would probably be much more useful in making this 
comparison than my sequential-order data, which certainly favors SCSU as it 
minimizes windows switches and creates repetitive patterns.  Also note that 
SCSU compressors are different and the same data encoded with a different 
compressor might yield more or fewer than 1,179,688 bytes.  I used my own 
compressor.

-Doug Ewell
 Fullerton, California




RE: A UTF-8 based News Service

2001-07-13 Thread Ayers, Mike


 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 

 Raw UTF-8   4,382,592
 Zipped UTF-82,264,152 (52% of raw UTF-8)
 Raw SCSU1,179,688 (27% of raw UTF-8)
 Zipped SCSU   104,316 (9% of raw SCSU,  5% of zipped UTF-8)

The data set is truly pathological.  Since it is in code point
order, there are patterns which in it which are probably being exploited.
Why not download some of the articles from a certain UTF-8 based news
website and run them through the tests?

Side note on compression:  Specialized compression methods tend to
have a bit of maintenance overhead.  Generalized compression methods tend,
in practice, to be better because they squeeze extra compression out of data
that would otherwise not be worth compressing (need more disk space?  don't
get a specialized compression routine for your biggest file - just compress
the whole drive!).  The general solution for the internet is IPPC (Internet
Protocol Payload Compression), which compresses all IP packets.  I am not
sure what state of development it is in, but if it is implementable now, I
would highly recommend testing its performance on Unicode web data.  I
expect the results to be comparable to specialized techniques, and you'd get
a transparent (IPPC is implemented such that hosts that don't support it are
unaffected by it), i.e. low maintenance, solution.


/|/|ike




Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Rick McGowan

 Unfortunately, you don't hear much about SCSU, and in particular the Unicode 
 Consortium doesn't really seem to promote it much (although they may be
 trying to avoid the too many UTF's syndrome).

Probably that's one point.  But also, SCSU is something that's a little more  
complicated to use, and needs to be pretty well negotiated between sender  
and receiver.  It's much less suitable for general-purpose interchange.

Rick





Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread David Starner

From: [EMAIL PROTECTED]
 None as far as I know, which sort of destroys the whole plan.  It would
sure
 be nice if MSIE and Navigator started quietly supporting SCSU, in the
same
 way that they quietly (to the average user) began supporting UTF-8.

If you want the code in Navigator, write it up for Mozilla and properly
submit it, at which point Mozilla will quietly start supporting SCSU, and
Navigator will follow suit in a couple releases.

--
David Starner - [EMAIL PROTECTED]





Re: A UTF-8 based News Service

2001-07-13 Thread David Starner

From: Keld Jørn Simonsen [EMAIL PROTECTED]
 UTF-16 is not just 2 bytes, it is sometimes 2 and sometimes 4 bytes.
 IETF is recommending UTF-8 as the prime charset in all Internet protocols.

Blah. For his purposes, UTF-16 is 2 bytes. The odds his newspaper will have
significant quantities of non-BMP in the near future seems very unlikely.
Yes, UTF-8 is the prime charset in all Internet protocols; that doesn't mean
they can't use others, it just means that the default should be UTF-8, and
in some situations, like the dict protocol, where you don't want to mess
with charset negotiation, you can hardwire it to UTF-8. In web and email,
you have the choice of charsets, and if a non-UTF-8 charset will make your
life and the life of your readers easier, go for it.

--
David Starner - [EMAIL PROTECTED]





Re: A UTF-8 based News Service

2001-07-12 Thread DougEwell2

In a message dated 2001-07-12 8:27:20 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  As someone involved in the service I often wish there was some
  form of compressed Unicode encoding.  The 3-byte penalty that
  Ethiopic bears under UTF-8 turns into higher bandwidth that web
  hosting services meter and charge for by the megabyte.  For a
  popular site this soon makes UTF-8 a costly option to support.

  A system analagous to iso-8859-x whereby Ethiopic and other scripts
  in the 3 byte range could be shifted back into the 2 byte range
  might help (generally only English and Ethiopic is desired together).

Today is your lucky day.  Check out Unicode Technical Standard #6, A 
Standard Compression Scheme for Unicode:

http://www.unicode.org/unicode/reports/tr6/

SCSU uses 128-byte windows to compress small alphabetic scripts to almost 1 
byte per character.  Since Ethiopic occupies three 128-character half-blocks, 
SCSU must use three windows and switch between them, but the overhead is 
still much lower than UTF-8.  In the worst case (each character belongs to a 
different half-block than the one before), you will still use only 2 bytes 
per character.

SCSU is fully supported by SC UniPad, a Unicode text editor that is currently 
available for free.  For more information, visit:

http://www.unipad.org/

-Doug Ewell
 Fullerton, California




More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-12 Thread DougEwell2

I should have also mentioned that SCSU is fully supported by the programming 
toolkit ICU (International Components for Unicode), found at:

http://oss.software.ibm.com/icu/

An Open Source project, ICU is available for free and comes with voluminous 
documentation.

SCSU is also registered as an IANA charset, although you are unlikely to find 
raw SCSU text on the Internet, due to its use of control characters (bytes 
below 0x20).

Hope this helps.

-Doug Ewell
 Fullerton, California




Re: A UTF-8 based News Service

2001-07-12 Thread David Starner

 As someone involved in the service I often wish there was some
 form of compressed Unicode encoding.  The 3-byte penalty that
 Ethiopic bears under UTF-8 turns into higher bandwidth that web
 hosting services meter and charge for by the megabyte.  For a
 popular site this soon makes UTF-8 a costly option to support.

 A system analagous to iso-8859-x whereby Ethiopic and other scripts
 in the 3 byte range could be shifted back into the 2 byte range
 might help (generally only English and Ethiopic is desired together).

 Fortunately there is mod_gzip for Apache.  I would appreciate any
 information about other options.

What about UTF-16? Encode all characters as 2 bytes, and your problem is
solved, and UTF-16 should be supported by all recent Unicode-supporting web
browsers.

--
David Starner - [EMAIL PROTECTED]