subject:"RE\: A UTF\-8 based News Service"

RE: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-21 Thread Yves Arrouye


> >  SCSU doesn't look very nice for me. The idea is OK but it's just
> >  too complicated. Various proposals of encodings differences or xors
> >  between consecutive characters are IMHO technically better: much
> >  simpler to implement and work as well.
> 
> These differential schemes seem to be the way IDN 
> (internationalized domain 
> names) are headed.  They are intended for the limited scope 
> of domain names 
> that have already passed through nameprep, which performs 
> normalization and 
> further limits the range of allowable characters.  I'm not 
> sure how well the 
> ACEs would perform with arbitrary Unicode text.  I suppose 
> only testing would 
> answer that question.

Also don't forget they're likely to add some code point reordering. Do we
want that too in an alternate scheme? Then is it really that much simpler
than SCSU? (Probably; tables for code point reordering are not complex to
build. But they may take some effort to optimize, so my guess is the
implementation effort may be roughly the same.)

YA

Re: A UTF-8 based News Service

2001-07-13 Thread David Starner


From: Kevin Bracey <[EMAIL PROTECTED]>
> Much as I love SCSU, and much as my web browser supports it, it's not the
> sort of thing to start encouraging on the wire when there are already
> existing standards to deal with this.

Why not? It can be further compressed by currently existing mechanisms, and
if we can exchange dozens of charsets for 3 (UTF-8, UTF-16, SCSU), I'd say
that's a good win, since all three of those have their place.

--
David Starner - [EMAIL PROTECTED]

Re: A UTF-8 based News Service

2001-07-13 Thread David Starner


From: Keld Jørn Simonsen <[EMAIL PROTECTED]>
> UTF-16 is not just 2 bytes, it is sometimes 2 and sometimes 4 bytes.
> IETF is recommending UTF-8 as the prime charset in all Internet protocols.

Blah. For his purposes, UTF-16 is 2 bytes. The odds his newspaper will have
significant quantities of non-BMP in the near future seems very unlikely.
Yes, UTF-8 is the prime charset in all Internet protocols; that doesn't mean
they can't use others, it just means that the default should be UTF-8, and
in some situations, like the dict protocol, where you don't want to mess
with charset negotiation, you can hardwire it to UTF-8. In web and email,
you have the choice of charsets, and if a non-UTF-8 charset will make your
life and the life of your readers easier, go for it.

--
David Starner - [EMAIL PROTECTED]

Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread David Starner


From: <[EMAIL PROTECTED]>
> None as far as I know, which sort of destroys the whole plan.  It would
sure
> be nice if MSIE and Navigator started "quietly" supporting SCSU, in the
same
> way that they "quietly" (to the average user) began supporting UTF-8.

If you want the code in Navigator, write it up for Mozilla and properly
submit it, at which point Mozilla will "quietly" start supporting SCSU, and
Navigator will follow suit in a couple releases.

--
David Starner - [EMAIL PROTECTED]

Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Rick McGowan


> Unfortunately, you don't hear much about SCSU, and in particular the Unicode 
> Consortium doesn't really seem to promote it much (although they may be
> trying to avoid the "too many UTF's" syndrome).

Probably that's one point.  But also, SCSU is something that's a little more  
complicated to use, and needs to be pretty well negotiated between sender  
and receiver.  It's much less suitable for general-purpose interchange.

Rick

RE: A UTF-8 based News Service

2001-07-13 Thread Ayers, Mike



> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 

> Raw UTF-8   4,382,592
> Zipped UTF-82,264,152 (52% of raw UTF-8)
> Raw SCSU1,179,688 (27% of raw UTF-8)
> Zipped SCSU   104,316 (9% of raw SCSU, < 5% of zipped UTF-8)

The data set is truly pathological.  Since it is in code point
order, there are patterns which in it which are probably being exploited.
Why not download some of the articles from a certain UTF-8 based news
website and run them through the tests?

Side note on compression:  Specialized compression methods tend to
have a bit of maintenance overhead.  Generalized compression methods tend,
in practice, to be better because they squeeze extra compression out of data
that would otherwise not be worth compressing (need more disk space?  don't
get a specialized compression routine for your biggest file - just compress
the whole drive!).  The general solution for the internet is IPPC (Internet
Protocol Payload Compression), which compresses all IP packets.  I am not
sure what state of development it is in, but if it is implementable now, I
would highly recommend testing its performance on Unicode web data.  I
expect the results to be comparable to specialized techniques, and you'd get
a transparent (IPPC is implemented such that hosts that don't support it are
unaffected by it), i.e. low maintenance, solution.


/|/|ike

Re: A UTF-8 based News Service

2001-07-13 Thread DougEwell2


In a message dated 2001-07-13 7:00:26 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  Sounds promising!  How well does SCSU gzip?

If gzip works anything like PKZIP, the answer is, very well indeed.  This is 
because (using the explanation I have heard before) SCSU retargets Unicode 
text to an 8-bit model, meaning that for small alphabetic scripts (or 
medium-sized syllabic scripts like Ethiopic), most characters are represented 
in one byte, so the information appears 8 bits at a time.  Many 
general-purpose compression algorithms are optimized for that kind of data.

Recently I created a test file of all Unicode characters in code point order 
(excluding the surrogates, but including all the other non-characters).  I 
will admit up front that this is a pathological test case and real-world data 
probably won't behave anywhere near the same.  Having said that, here are the 
file sizes of this data expressed in UTF-8 and SCSU, raw and zipped (using 
PKZIP 4.0):

Raw UTF-8   4,382,592
Zipped UTF-82,264,152 (52% of raw UTF-8)
Raw SCSU1,179,688 (27% of raw UTF-8)
Zipped SCSU   104,316 (9% of raw SCSU, < 5% of zipped UTF-8)

So PKZIP compressed this particular (non-real-world) UTF-8 data by only 48%, 
but compressed the equivalent SCSU data by a whopping 91%.  That's because 
SCSU puts the data in an 8-bit model, which brings out the best in PKZIP.  
Gzip may work the same.

Note that real-world data would probably be much more useful in making this 
comparison than my sequential-order data, which certainly favors SCSU as it 
minimizes windows switches and creates repetitive patterns.  Also note that 
SCSU compressors are different and the same data encoded with a different 
compressor might yield more or fewer than 1,179,688 bytes.  I used my own 
compressor.

-Doug Ewell
 Fullerton, California

Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread DougEwell2

In a message dated 2001-07-13 4:07:35 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  SCSU doesn't look very nice for me. The idea is OK but it's just
>  too complicated. Various proposals of encodings differences or xors
>  between consecutive characters are IMHO technically better: much
>  simpler to implement and work as well.

These differential schemes seem to be the way IDN (internationalized domain 
names) are headed.  They are intended for the limited scope of domain names 
that have already passed through nameprep, which performs normalization and 
further limits the range of allowable characters.  I'm not sure how well the 
ACEs would perform with arbitrary Unicode text.  I suppose only testing would 
answer that question.

-Doug Ewell
 Fullerton, California

Re: A UTF-8 based News Service

2001-07-13 Thread Daniel Yacob

[EMAIL PROTECTED] wrote:
> 
> As a test, I downloaded the first article on the page:
> 
> http://unicode.ethiozena.net/Gazettas/Kibrit/Archives/1993/Hamle/05/Kibrit.051 
> 193.sera.html
> 
> The article, dated 1993-05-11, has the formidable title:
> 

Yesterday in the Ethiopian calendar :) 

> «p-t negaso gidada wedeTalyan kobelelu teblo yeteseraCew zegeba f`Sum Heset
> new» yeTalyan Embasi
> 

Titles (in  markups) remain transliterated since a number of
browsers
that support UTF-8 viewing in the page display area do not in the
"title" area
of the browser's application window.  Transliterated Ethiopic actually
fairs
better than UTF-8 since consonants can be a single byte, syllables 2
bytes
and diphthongs 3.  On average a document might "compress" with
transliteration
down to 53%.  Not so easy on the eyes though but useful as a last
resort.

>
> Encoded in UTF-8, the file was 1891 bytes long.  Converted into SCSU, it
> dropped to 1121 bytes, which is 40% shorter than the UTF-8 version, better
> than UTF-16, and probably better than any existing legacy encoding for
> Ethiopic.  SCSU is a Good Thing.

Sounds promising!  How well does SCSU gzip?

/Daniel

Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Marcin 'Qrczak' Kowalczyk


Fri, 13 Jul 2001 03:01:10 EDT, [EMAIL PROTECTED] <[EMAIL PROTECTED]> pisze:

> Unfortunately, you don't hear much about SCSU, and in particular
> the Unicode Consortium doesn't really seem to promote it much
> (although they may be trying to avoid the "too many UTF's" syndrome).

SCSU doesn't look very nice for me. The idea is OK but it's just
too complicated. Various proposals of encodings differences or xors
between consecutive characters are IMHO technically better: much
simpler to implement and work as well.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTĘPCZA
QRCZAK

Re: A UTF-8 based News Service

2001-07-13 Thread Keld Jørn Simonsen


On Fri, Jul 13, 2001 at 02:14:25AM +0100, David Starner wrote:
> > As someone involved in the service I often wish there was some
> > form of "compressed" Unicode encoding.  The 3-byte penalty that
> > Ethiopic bears under UTF-8 turns into higher bandwidth that web
> > hosting services meter and charge for by the megabyte.  For a
> > popular site this soon makes UTF-8 a costly option to support.
> >
> > A system analagous to iso-8859-x whereby Ethiopic and other scripts
> > in the 3 byte range could be shifted back into the 2 byte range
> > might help (generally only English and Ethiopic is desired together).
> >
> > Fortunately there is mod_gzip for Apache.  I would appreciate any
> > information about other options.
> 
> What about UTF-16? Encode all characters as 2 bytes, and your problem is
> solved, and UTF-16 should be supported by all recent Unicode-supporting web
> browsers.

UTF-16 is not just 2 bytes, it is sometimes 2 and sometimes 4 bytes.
IETF is recommending UTF-8 as the prime charset in all Internet protocols.

Kind regards
Keld

Re: A UTF-8 based News Service

2001-07-13 Thread Kevin Bracey

In message <[EMAIL PROTECTED]>
  [EMAIL PROTECTED] wrote:

> 
> Encoded in UTF-8, the file was 1891 bytes long.  Converted into SCSU, it 
> dropped to 1121 bytes, which is 40% shorter than the UTF-8 version, better 
> than UTF-16, and probably better than any existing legacy encoding for 
> Ethiopic.  SCSU is a Good Thing.
> 

Much as I love SCSU, and much as my web browser supports it, it's not the
sort of thing to start encouraging on the wire when there are already
existing standards to deal with this.

Using HTTP transfer encoding to deflate the data being transferred will work
well on most browsers, and is implemented by all good webservers. A brief
test shows deflate can compress it down to 1027 bytes (although I had the
original size as 2201 bytes).

-- 
Kevin Bracey, Principal Software Engineer
Pace Micro Technology plc Tel: +44 (0) 1223 518566
645 Newmarket RoadFax: +44 (0) 1223 518526
Cambridge, CB5 8PB, United KingdomWWW: http://www.pace.co.uk/

Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread DougEwell2

In a message dated 2001-07-12 22:55:09 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>> SCSU is also registered as an IANA charset, although you are 
>> unlikely to find 
>> raw SCSU text on the Internet, due to its use of control 
>> characters (bytes below 0x20).
>
>  And what browser supports SCSU, and what it that browser's reach in term of
>  population? Because that's usually what matters to people that publish on
>  the Internet.

None as far as I know, which sort of destroys the whole plan.  It would sure 
be nice if MSIE and Navigator started "quietly" supporting SCSU, in the same 
way that they "quietly" (to the average user) began supporting UTF-8.

Unfortunately, you don't hear much about SCSU, and in particular the Unicode 
Consortium doesn't really seem to promote it much (although they may be 
trying to avoid the "too many UTF's" syndrome).

-Doug Ewell
 Fullerton, California

Re: A UTF-8 based News Service

2001-07-13 Thread DougEwell2

In a message dated 2001-07-12 8:27:20 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  The Ethiopian News Headlines has relocated to a new server at
>  http://www.ethiozena.net/ and is making it easier than ever to
>  read news headlines in Unicode.  A companion Unicode only server
>  is launched at http://unicode.ethiozena.net/ which serves
>  articles in UTF-8 encoding only.

As a test, I downloaded the first article on the page:

http://unicode.ethiozena.net/Gazettas/Kibrit/Archives/1993/Hamle/05/Kibrit.051

193.sera.html

The article, dated 1993-05-11, has the formidable title:

«p-t negaso gidada wedeTalyan kobelelu teblo yeteseraCew zegeba f`Sum Heset 
new» yeTalyan Embasi

Encoded in UTF-8, the file was 1891 bytes long.  Converted into SCSU, it 
dropped to 1121 bytes, which is 40% shorter than the UTF-8 version, better 
than UTF-16, and probably better than any existing legacy encoding for 
Ethiopic.  SCSU is a Good Thing.

-Doug Ewell
 Fullerton, California

RE: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-12 Thread Yves Arrouye


> SCSU is also registered as an IANA charset, although you are 
> unlikely to find 
> raw SCSU text on the Internet, due to its use of control 
> characters (bytes 
> below 0x20).

And what browser supports SCSU, and what it that browser's reach in term of
population? Because that's usually what matters to people that publish on
the Internet.

YA

Re: A UTF-8 based News Service

2001-07-12 Thread David Starner


> As someone involved in the service I often wish there was some
> form of "compressed" Unicode encoding.  The 3-byte penalty that
> Ethiopic bears under UTF-8 turns into higher bandwidth that web
> hosting services meter and charge for by the megabyte.  For a
> popular site this soon makes UTF-8 a costly option to support.
>
> A system analagous to iso-8859-x whereby Ethiopic and other scripts
> in the 3 byte range could be shifted back into the 2 byte range
> might help (generally only English and Ethiopic is desired together).
>
> Fortunately there is mod_gzip for Apache.  I would appreciate any
> information about other options.

What about UTF-16? Encode all characters as 2 bytes, and your problem is
solved, and UTF-16 should be supported by all recent Unicode-supporting web
browsers.

--
David Starner - [EMAIL PROTECTED]

More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-12 Thread DougEwell2


I should have also mentioned that SCSU is fully supported by the programming 
toolkit ICU (International Components for Unicode), found at:

http://oss.software.ibm.com/icu/

An Open Source project, ICU is available for free and comes with voluminous 
documentation.

SCSU is also registered as an IANA charset, although you are unlikely to find 
raw SCSU text on the Internet, due to its use of control characters (bytes 
below 0x20).

Hope this helps.

-Doug Ewell
 Fullerton, California

Re: A UTF-8 based News Service

2001-07-12 Thread DougEwell2

In a message dated 2001-07-12 8:27:20 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  As someone involved in the service I often wish there was some
>  form of "compressed" Unicode encoding.  The 3-byte penalty that
>  Ethiopic bears under UTF-8 turns into higher bandwidth that web
>  hosting services meter and charge for by the megabyte.  For a
>  popular site this soon makes UTF-8 a costly option to support.
>
>  A system analagous to iso-8859-x whereby Ethiopic and other scripts
>  in the 3 byte range could be shifted back into the 2 byte range
>  might help (generally only English and Ethiopic is desired together).

Today is your lucky day.  Check out Unicode Technical Standard #6, "A 
Standard Compression Scheme for Unicode":

http://www.unicode.org/unicode/reports/tr6/

SCSU uses 128-byte windows to compress small alphabetic scripts to almost 1 
byte per character.  Since Ethiopic occupies three 128-character half-blocks, 
SCSU must use three windows and switch between them, but the overhead is 
still much lower than UTF-8.  In the worst case (each character belongs to a 
different half-block than the one before), you will still use only 2 bytes 
per character.

SCSU is fully supported by SC UniPad, a Unicode text editor that is currently 
available for free.  For more information, visit:

http://www.unipad.org/

-Doug Ewell
 Fullerton, California

RE: More about SCSU (was: Re: A UTF-8 based News Service)

Re: A UTF-8 based News Service

Re: A UTF-8 based News Service

Re: More about SCSU (was: Re: A UTF-8 based News Service)

Re: More about SCSU (was: Re: A UTF-8 based News Service)

RE: A UTF-8 based News Service

Re: A UTF-8 based News Service

Re: More about SCSU (was: Re: A UTF-8 based News Service)

Re: A UTF-8 based News Service

Re: More about SCSU (was: Re: A UTF-8 based News Service)

Re: A UTF-8 based News Service

Re: A UTF-8 based News Service

Re: More about SCSU (was: Re: A UTF-8 based News Service)

Re: A UTF-8 based News Service

RE: More about SCSU (was: Re: A UTF-8 based News Service)

Re: A UTF-8 based News Service

More about SCSU (was: Re: A UTF-8 based News Service)

Re: A UTF-8 based News Service

18 matches

Site Navigation

Mail list logo

Footer information