Re: unicode in emacs 21

2001-11-04 Thread Dave Love

> "EZ" == Eli Zaretskii <[EMAIL PROTECTED]> writes:

 EZ> Unless you refer to the CNS plane and Japanese Han characters,
 EZ> which were deliberately left ununified (in addition to the
 EZ> Unicode codepoints for those characters), I think you are
 EZ> mistaken.

I.e., he's right.

Someone needs to give a cogent argument why it's a problem in practice
to have multiple representations if you can canonicalize as required,
especially why this should be any different for Western scripts than
for CJK.  Note that I have some practical experience of this in Emacs.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: unicode in emacs 21

2001-11-04 Thread Dave Love

> "EZ" == Eli Zaretskii <[EMAIL PROTECTED]> writes:

 EZ> The current plan for Unicode was discussed at length 3 years ago, and
 EZ> the result was what I described.  I don't think it's wise for us to
 EZ> reopen that discussion again

Well I, at least, don't understand why it's necessary, at least for
technical reasons.  I have a fair amount of experience as a user and
implementor.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: unicode in emacs 21

2001-11-04 Thread Dave Love

> "MK" == Markus Kuhn <[EMAIL PROTECTED]> writes:

 MK> If you can edit the UTF-8 test file

 MK>   http://www.cl.cam.ac.uk/~mgk25/ucs/examples/  UTF-8-test.txt

That's what I mean by test cases.  I can't remember which ones fail,
but I suspect it's non-BMP ones.  There are a couple of ways to fix
it, but I don't think it's important.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: unicode in emacs 21

2001-11-04 Thread Dave Love

> "JK" == Jimmy Kaplowitz <[EMAIL PROTECTED]> writes:

 JK> It's the only editor I've used (including Yudit) that could
 JK> display the sequence U+0283 U+034D correctly.

[With what font?]

Note that character composition (combination) is a user-level feature
in Emacs, so if rules are implemented which you don't like, you can
change them.

 JK> Well, Emacs does have more features (including some that are less
 JK> essential, such as doctor mode :), but vim has quite enough for
 JK> most purposes.

I assumed the point was specifically about the display, tty v. X.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: unicode in emacs 21

2001-11-04 Thread Florian Weimer

"Eli Zaretskii" <[EMAIL PROTECTED]> writes:

>> The GNU Emacs/Unicode proposal I've seen seems to have this property,
>> too.  (At least the proposal is ambiguous, and one interpretation is
>> that you can encode a single character in multiple ways.)
>
> Unless you refer to the CNS plane and Japanese Han characters, which
> were deliberately left ununified (in addition to the Unicode
> codepoints for those characters), I think you are mistaken.

I hope so. ;-)

> Could you please point out where in the proposal do you see that a
> character can be encoded in multiple ways?

I think now that the surrogate stuff has been explained, the encoding
to to UCS-E (Unicode-compatible Character Set for Emacs) is indeed
unambiguous.

However, UTF-E (the buffer encoding) opens possibilities for different
encodings of the same UCS-E code point, but this can be resolved, I
think.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Malformed utf-8 sequences (was: unicode in emacs 21)

2001-10-31 Thread Robert de Bath

On 27 Oct 2001, H. Peter Anvin wrote:

> I don't think it's that hard, actually.  The best way to think of a
> malformed UTF-8 sequence is as a unique character with no semantic
> meaning.  Of course, what constitutes the "sequence" is somewhat
> arbitrary, but ideally such an encoding should have the following
> properties:
>
> a) It cannot be used to encode valid characters.
> b) It's unambiguous (only one possible encoding for any one possible
>sequence.)

Ermm, Peter, I have a problem with those rules.

Without (a) it's just a 'rawbytes' encoding for all the bytes in the
input stream that are part of an invalid sequence. The only, minor,
issue being that you could edit a sequence of rawbytes to make them
into a valid sequence ... is that really an 'issue'!

If you enforce (a) then you need a 'rawbyte that must be kept with
the previous byte' in 28 bits (assuming utf-8) you can do a lead byte
plus upto 3 continuations.

Rawbyte:8 bits
cont count: 2 bits
cont1:  6 bits
cont2:  6 bits
cont3:  6 bits

In addition you must forbid the entry of rawbytes that are continuation
bytes _and_ the removal of characters that may bring two rawbyte sequences
together. eg: a leadbyte and it's continuation that have been seperated
by a non-utf8 folding program.

Personally I think (a) is impossible (unreasonable) for an editor,
even emacs.

-- 
Rob.  (Robert de Bath )
   

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-30 Thread Eli Zaretskii

> From: Florian Weimer <[EMAIL PROTECTED]>
> Date: Tue, 30 Oct 2001 08:09:20 +0100
> 
> Richard Stallman <[EMAIL PROTECTED]> writes:
> 
> > Supporting Unicode superficially while retaining the current internal
> > representation raises a number of problems, one of them being that the
> > internal representation has several alternatives for the same character
> > which correspond to the same code point in Unicode.
> 
> The GNU Emacs/Unicode proposal I've seen seems to have this property,
> too.  (At least the proposal is ambiguous, and one interpretation is
> that you can encode a single character in multiple ways.)

Unless you refer to the CNS plane and Japanese Han characters, which
were deliberately left ununified (in addition to the Unicode
codepoints for those characters), I think you are mistaken.  Could you
please point out where in the proposal do you see that a character can
be encoded in multiple ways?
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-30 Thread Florian Weimer

Richard Stallman <[EMAIL PROTECTED]> writes:

> Supporting Unicode superficially while retaining the current internal
> representation raises a number of problems, one of them being that the
> internal representation has several alternatives for the same character
> which correspond to the same code point in Unicode.

The GNU Emacs/Unicode proposal I've seen seems to have this property,
too.  (At least the proposal is ambiguous, and one interpretation is
that you can encode a single character in multiple ways.)
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-29 Thread Markus Kuhn

Dave Love wrote on 2001-10-29 17:49 UTC:
> I don't think it's very important that reading and writing malformed
> sequences by utf-8.el isn't always idempotent.  Presumably the three
> or four relevant test cases could be addressed in the CCL, but I think
> there are better things to spend the time on.

If you can edit the UTF-8 test file

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/  UTF-8-test.txt

(which is rich in al forms of malformed UTF-8 sequences) without causing
surprising changes at places where you didn't insert/delete any
characters during editing, then things should be fine.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-29 Thread Dave Love

> "MK" == Markus Kuhn <[EMAIL PROTECTED]> writes:

 MK> Using UTF-8 as the internal Emacs encoding is one way of achieving
 MK> continued guaranteed binary transparency, 

I.e., maintain a malformed internal representation??

 MK> coming up with a tricky encoding for malformed UTF-8 sequences is
 MK> another one.

We can maintain arbitrary byte sequences now.  It's not terribly
tricky, just not too robust through the use of the eight-bit-x
charsets.

I don't think it's very important that reading and writing malformed
sequences by utf-8.el isn't always idempotent.  Presumably the three
or four relevant test cases could be addressed in the CCL, but I think
there are better things to spend the time on.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-29 Thread Dave Love

> "MK" == Markus Kuhn <[EMAIL PROTECTED]> writes:

 MK> CJK Greek/Cyrillic characters are traditionally displayed as
 MK> double-width, whereas ISO 8859/ISO 10646 Greek & Cyrillic
 MK> characters are traditionally displayed single-width.

Yes, but...

 MK> But surely all the European encodings such as ISO 8859, KOI,
 MK> etc. should be urgently unified with Unicode.

The implementation you may recall hearing about earlier in the year is
now available (posted to gnu.emacs.sources).
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-29 Thread Richard Stallman

That view is unfair to the people who have done lots of work, himi in
particular.  `Working on Unicode support' in my book isn't restricted
to implementing an apparently-unnecessary, disruptive, incompatible
change to the internal encoding, even if it's what one wants ideally.

I think that supporting Unicode at the internal level is the best way
to support it fully, and that's what we have decided to do.  As a
result of that decision, we are sometimes reluctant to put time into
studying, installing and maintaining other approaches which would be
obsolete once we do it the right way.

Supporting Unicode superficially while retaining the current internal
representation raises a number of problems, one of them being that the
internal representation has several alternatives for the same character
which correspond to the same code point in Unicode.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-29 Thread Eli Zaretskii


On 28 Oct 2001, Dave Love wrote:

>  EZ> What can I say except ``volunteers are welcome...'' etc.?  I can't 
>  EZ> believe no one wants Unicode badly enough to work on its support in 
>  EZ> Emacs, but what do I do with facts which fly in my face?
> 
> That view is unfair to the people who have done lots of work, himi in
> particular.

Granted, no unfairness was intended; I apologize if my wording suggested 
otherwise.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



rwar (was RE: unicode in emacs 21)

2001-10-28 Thread Edward Cherlin

Please subside. If this is a real issue we can arrange a fair side-by-side
test.


Edward Cherlin
There are lies, damned lies, and benchmarks.

>-Original Message-
>From: [EMAIL PROTECTED]
>[mailto:[EMAIL PROTECTED]]On Behalf Of Jimmy Kaplowitz
>Sent: Sun, October 28, 2001 9:19 AM
>To: [EMAIL PROTECTED]
>Cc: Oliver Doepner; [EMAIL PROTECTED]
>Subject: Re: unicode in emacs 21
>
>
>On Sun, Oct 28, 2001 at 05:04:22PM +, Dave Love wrote:
>> >>>>> "OD" == Oliver Doepner <[EMAIL PROTECTED]> writes:
>>
>>  OD> There is vim 6.x now with full utf-8 support on the xterm.
>>
>> [Does `full utf-8 support' mean level 3?]
>
>Well, it handles double-width characters as well as up to two combining
>characters. It's the only editor I've used (including Yudit) that could
>display the sequence U+0283 U+034D correctly.
>
>> Emacs can do utf-8 i/o under ttys that support it, though you don't
>> _need_ such support -- either input or output -- to edit utf-8 text.
>>
>>  OD> It is much faster than emacs on x11 of course.
>>
>> I'm surprised that's much of an issue.  I assume Emacs under X is much
>> more capable.
>
>Well, Emacs does have more features (including some that are less
>essential, such as doctor mode :), but vim has quite enough for most
>purposes.
>
>>  OD> I was happy to see Emacs 21 announced. but the unicode support
>>  OD> does not seem to have moved forward very much
>>
>> It's moved from zero to the state where it's perfectly fine for
>> editing at least the Western technical text that interests me.  E.g.,
>> Kuhn's UTF-8-demo.utf works modulo the level 2 text, for which one can
>> add support straightforwardly at the Lisp level.  It also allowed
>> producing coding systems for all the 8-bit charsets for GNUish
>> locales, which perhaps matters more in the wide world than utf-8 per
>> se.  With some customization, I can also at least _display_
>> utf-8-encoded CJK text.  I can send and receive utf-8-encoded mail and
>> browse utf-8-encoded web sites (with the development W3 package).
>
>Vim can display the UTF-8-demo file perfectly, with no exceptions. Also,
>although I haven't tested this, I am told it can write as well as
>display utf-8 CJK text.
>
>- Jimmy Kaplowitz
>[EMAIL PROTECTED] / [EMAIL PROTECTED]
>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Markus Kuhn

On 28 Oct 2001, Dave Love wrote:
>  EZ> So we have two versions of Cyrillic characters, two versions of
>  EZ> Greek characters, two versions of Hebrew characters, etc.:  one
>  EZ> version in the new Unicode set, the other version in the old Mule
>  EZ> set.
>
> There are more than two, at least for Greek and Cyrillic.  Those in
> the Far Eastern charsets could be unified too if anyone cared.

Full unification here would have the disadvantage that CJK Greek/Cyrillic
characters are traditionally displayed as double-width, whereas ISO
8859/ISO 10646 Greek & Cyrillic characters are traditionally displayed
single-width. Some CJK users might be quite happy about a lack of
unification here to preserve the display width of these characters. Same
for the block graphics characters, which xterm with ISO10646 fonts
displays single-width whereas kterm with JIS/etc. fonts displays in
double-width.

But surely all the European encodings such as ISO 8859, KOI, etc. should
be urgently unified with Unicode. The relevant standards have already been
(re)written to represent these encodings just as single-byte encodings of
ISO 10646 subsets.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Jimmy Kaplowitz

On Sun, Oct 28, 2001 at 05:04:22PM +, Dave Love wrote:
> > "OD" == Oliver Doepner <[EMAIL PROTECTED]> writes:
> 
>  OD> There is vim 6.x now with full utf-8 support on the xterm.
> 
> [Does `full utf-8 support' mean level 3?]

Well, it handles double-width characters as well as up to two combining
characters. It's the only editor I've used (including Yudit) that could
display the sequence U+0283 U+034D correctly.

> Emacs can do utf-8 i/o under ttys that support it, though you don't
> _need_ such support -- either input or output -- to edit utf-8 text.
> 
>  OD> It is much faster than emacs on x11 of course.
> 
> I'm surprised that's much of an issue.  I assume Emacs under X is much
> more capable.

Well, Emacs does have more features (including some that are less
essential, such as doctor mode :), but vim has quite enough for most
purposes.

>  OD> I was happy to see Emacs 21 announced. but the unicode support
>  OD> does not seem to have moved forward very much
> 
> It's moved from zero to the state where it's perfectly fine for
> editing at least the Western technical text that interests me.  E.g.,
> Kuhn's UTF-8-demo.utf works modulo the level 2 text, for which one can
> add support straightforwardly at the Lisp level.  It also allowed
> producing coding systems for all the 8-bit charsets for GNUish
> locales, which perhaps matters more in the wide world than utf-8 per
> se.  With some customization, I can also at least _display_
> utf-8-encoded CJK text.  I can send and receive utf-8-encoded mail and
> browse utf-8-encoded web sites (with the development W3 package).

Vim can display the UTF-8-demo file perfectly, with no exceptions. Also,
although I haven't tested this, I am told it can write as well as
display utf-8 CJK text.

- Jimmy Kaplowitz
[EMAIL PROTECTED] / [EMAIL PROTECTED]

 PGP signature


Re: unicode in emacs 21

2001-10-28 Thread Dave Love

> "EZ" == Eli Zaretskii <[EMAIL PROTECTED]> writes:

 EZ> The problem is that characters are still not unified in Emacs 21.

A package was contributed to do that for ISO 8859 characters.  It's
been posted to gnu.emacs.sources, so that shouldn't be an issue for
anyone who's bothered by it.

 EZ> So we have two versions of Cyrillic characters, two versions of
 EZ> Greek characters, two versions of Hebrew characters, etc.:  one
 EZ> version in the new Unicode set, the other version in the old Mule
 EZ> set.

There are more than two, at least for Greek and Cyrillic.  Those in
the Far Eastern charsets could be unified too if anyone cared.  This
issue clearly doesn't apply only to the Unicode charsets, and, as a
user, I don't think it's much of a problem in practice.

 EZ> What can I say except ``volunteers are welcome...'' etc.?  I can't 
 EZ> believe no one wants Unicode badly enough to work on its support in 
 EZ> Emacs, but what do I do with facts which fly in my face?

That view is unfair to the people who have done lots of work, himi in
particular.  `Working on Unicode support' in my book isn't restricted
to implementing an apparently-unnecessary, disruptive, incompatible
change to the internal encoding, even if it's what one wants ideally.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Dave Love

> "OD" == Oliver Doepner <[EMAIL PROTECTED]> writes:

 OD> There is vim 6.x now with full utf-8 support on the xterm.

[Does `full utf-8 support' mean level 3?]

Emacs can do utf-8 i/o under ttys that support it, though you don't
_need_ such support -- either input or output -- to edit utf-8 text.

 OD> It is much faster than emacs on x11 of course.

I'm surprised that's much of an issue.  I assume Emacs under X is much
more capable.

 OD> I was happy to see Emacs 21 announced. but the unicode support
 OD> does not seem to have moved forward very much

It's moved from zero to the state where it's perfectly fine for
editing at least the Western technical text that interests me.  E.g.,
Kuhn's UTF-8-demo.utf works modulo the level 2 text, for which one can
add support straightforwardly at the Lisp level.  It also allowed
producing coding systems for all the 8-bit charsets for GNUish
locales, which perhaps matters more in the wide world than utf-8 per
se.  With some customization, I can also at least _display_
utf-8-encoded CJK text.  I can send and receive utf-8-encoded mail and
browse utf-8-encoded web sites (with the development W3 package).

The Mule-UCS package provides more if necessary, specifically better
coverage of the BMP.

 OD> Is the internal representation still the special MULE format ??~

Yes.  So what?  [There has been much mis-representation of Mule, some
of it malicious.]  There is a yet-unimplemented scheme for coverage up
to U+10 within that encoding.  Even now, with Lisp-level changes
one could build an (incompatible) Emacs to cover the BMP, sacrificing
some of the standard charsets.

-- 
Bragging about Unicode support: ‘2d sinθ = nλ’ is plain text. ☺
http://www.unicode.org/>
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Florian Weimer

Eli Zaretskii <[EMAIL PROTECTED]> writes:

>> Why can't you continue to use the MULE code and just change the
>> character sets to reflect certain aspects of Unicode?
> 
> The current plan for Unicode was discussed at length 3 years ago, and
> the result was what I described.

Is the discussion archived somewhere, or are there some design
documents which resulted from the discussion?

> I don't think it's wise for us to reopen that discussion again,
> unless you think the UTF-8-based representation is a terribly wrong
> design.

Of course, it's hard to come up with constructive criticism when you
don't know what's already there. ;-)

> So I don't see any reason for the unnamed Unicode people to get
> annoyed by a term they themselves coined.

Me neither, but I got flamed in the past. :-/

> Conceivably, changing the internal representation doesn't mean we need
> to rewrite all of the existing code, just the low-level parts of it
> that deal with code conversions (i.e. subroutines of encoding and
> decoding functions).

I still don't understand the need for such a change.  In theory, the
internal representation of characters should be invisible to the
higher levels.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Eli Zaretskii


On 28 Oct 2001, Janusz S. =?iso-8859-2?q?Bie=F1?= wrote:

> On Eli Zaretskii <[EMAIL PROTECTED]>  wrote:
> 
> [...]
> 
> > Lately, the emacs-unicode mailing list was revived, in the hope that it 
> > will boost the activity.  Sadly, the traffic on that list is nil.
> 
> Was the list properly announced? I've seen a mention of it, but no
> instruction how to suscribe. Where is the list hosted? It is not
> accessible from the emacs pages at http://savannah.gnu.org.

It's not a public list (and, given the traffic, I'm not convinced it's 
worth the hassle to make it a public one).  However, the people who 
subscribe to that list know they are subscribed (they've asked for that 
explicitly), so no announcement seems to be necessary.

I can subscribe you if you want.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Janusz S. Bień

On Eli Zaretskii <[EMAIL PROTECTED]>  wrote:

[...]

> Lately, the emacs-unicode mailing list was revived, in the hope that it 
> will boost the activity.  Sadly, the traffic on that list is nil.

Was the list properly announced? I've seen a mention of it, but no
instruction how to suscribe. Where is the list hosted? It is not
accessible from the emacs pages at http://savannah.gnu.org.

Best regards

Janusz

-- 
 ,   
dr hab. Janusz S. Bien, prof. UW
Prof. Janusz S. Bien, Warsaw Uniwersity
http://www.orient.uw.edu.pl/~jsbien/
-
Na tym koncie czytam i wysylam poczte i wiadomosci offline.
On this account I read/post mail/news offline.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Florian Weimer

"H. Peter Anvin" <[EMAIL PROTECTED]> writes:

> Does that mean you're painting yourself into a corner, though,
> requiring manual work to integrate the increasingly Unicode-based
> infrastructure support that is becoming available?  Odds are pretty
> good that they are.

I don't think it is a good idea to use operating system Unicode
support.  This would mean that GNU Emacs behaves differently on
different operating systems, depending on the installed locale
descriptions, for example.

OTOH, the character encodings posted earlier to this list are as
incompatible with existing Unicode support as the current emacs-mule
internal encoding.  In effect, just one Emacs-specific internal
encoding is replaced by another.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Colin Paul Adams

> "Eli" == Eli Zaretskii <[EMAIL PROTECTED]> writes:

Eli> Would you like to be subscribed to emacs-unicode?

I would, please.
-- 
Colin Paul Adams
Preston Lancashire
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Eli Zaretskii

[I suggest to have this discussion on emacs-unicode mailing list, so I
added it to the list of addressees.]

> From: Florian Weimer <[EMAIL PROTECTED]>
> Date: Sun, 28 Oct 2001 00:58:29 +0200
> 
> "Eli Zaretskii" <[EMAIL PROTECTED]> writes:
> 
> > Emacs cannot use a pure UTF-8 encoding, since some cultures don't want
> > unification, and it was decided that Emacs should not force
> > unification on those cultures.
> 
> Why can't you continue to use the MULE code and just change the
> character sets to reflect certain aspects of Unicode?

The current plan for Unicode was discussed at length 3 years ago, and
the result was what I described.  I don't think it's wise for us to
reopen that discussion again, unless you think the UTF-8-based
representation is a terribly wrong design.

> One such aspect
> is Latin "unification", for example.  (The Unicode people get very
> annoyed if you talk about "unification", "source separation rule" etc.
> in the context of non-Han scripts...)

IIRC, the term "unification" appears early in the Unicode standard,
not necessarily in conjunction with ``Han unification''.  It is cited
as one of the principles on the Unicode approach.  So I don't see any
reason for the unnamed Unicode people to get annoyed by a term they
themselves coined.

> In a second step, support for normalization, combining characters
> etc. would have to be added, but this could be based on the reliable
> foundation of the old MULE code.

Conceivably, changing the internal representation doesn't mean we need
to rewrite all of the existing code, just the low-level parts of it
that deal with code conversions (i.e. subroutines of encoding and
decoding functions).
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-28 Thread Eli Zaretskii

> From: David Starner <[EMAIL PROTECTED]>
> Date: Sat, 27 Oct 2001 14:34:04 -0500
> 
> On Thu, Oct 25, 2001 at 07:11:30PM +0200, Eli Zaretskii wrote:
> > What can I say except ``volunteers are welcome...'' etc.?  I can't 
> > believe no one wants Unicode badly enough to work on its support in 
> > Emacs, but what do I do with facts which fly in my face?
> 
> I've spend several years trawling the net for Unicode information. I've
> heard about the emacs-unicode list, but never seen archives or
> subscription information.

Would you like to be subscribed to emacs-unicode?

> Nor have I ever seen the plans to support
> Unicode; looking in the Emacs 21 etc directory of junk provides neither
> these plans or any idea that they exist.

See etc/TODO.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-27 Thread D. Dale Gulledge

"H. Peter Anvin" wrote:

> Does that mean you're painting yourself into a corner, though,
> requiring manual work to integrate the increasingly Unicode-based
> infrastructure support that is becoming available?  Odds are pretty
> good that they are.

Since I volunteered to help with this effort, I'd like to know what's
already out there.  I agree that duplicating functionality in the Emacs
code that is already available from supported free libraries would be a
bad idea unless there is a compelling reason.  Of course, Emacs is
buildable on most systems that have a working C compiler and a standard
implementation of libc.  Depending on anything else, unless it can be
imported into the Emacs source tree would be a questionable idea.

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-27 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Florian Weimer <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
>
> "Eli Zaretskii" <[EMAIL PROTECTED]> writes:
> 
> > Emacs cannot use a pure UTF-8 encoding, since some cultures don't want
> > unification, and it was decided that Emacs should not force
> > unification on those cultures.
> 
> Why can't you continue to use the MULE code and just change the
> character sets to reflect certain aspects of Unicode?  One such aspect
> is Latin "unification", for example.  (The Unicode people get very
> annoyed if you talk about "unification", "source separation rule" etc.
> in the context of non-Han scripts...)
> 
> In a second step, support for normalization, combining characters
> etc. would have to be added, but this could be based on the reliable
> foundation of the old MULE code.
> 

Does that mean you're painting yourself into a corner, though,
requiring manual work to integrate the increasingly Unicode-based
infrastructure support that is becoming available?  Odds are pretty
good that they are.

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt<[EMAIL PROTECTED]>
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-27 Thread Florian Weimer

"Eli Zaretskii" <[EMAIL PROTECTED]> writes:

> Emacs cannot use a pure UTF-8 encoding, since some cultures don't want
> unification, and it was decided that Emacs should not force
> unification on those cultures.

Why can't you continue to use the MULE code and just change the
character sets to reflect certain aspects of Unicode?  One such aspect
is Latin "unification", for example.  (The Unicode people get very
annoyed if you talk about "unification", "source separation rule" etc.
in the context of non-Han scripts...)

In a second step, support for normalization, combining characters
etc. would have to be added, but this could be based on the reliable
foundation of the old MULE code.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-27 Thread David Starner

On Thu, Oct 25, 2001 at 07:11:30PM +0200, Eli Zaretskii wrote:
> What can I say except ``volunteers are welcome...'' etc.?  I can't 
> believe no one wants Unicode badly enough to work on its support in 
> Emacs, but what do I do with facts which fly in my face?

I've spend several years trawling the net for Unicode information. I've
heard about the emacs-unicode list, but never seen archives or
subscription information. Nor have I ever seen the plans to support
Unicode; looking in the Emacs 21 etc directory of junk provides neither
these plans or any idea that they exist. If you want people to jump
forward to implement these plans, then you need let people know about
them.

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I saw a daemon stare into my face, and an angel touch my breast; each 
one softly calls my name . . . the daemon scares me less."
- "Disciple", Stuart Davis
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-27 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Markus Kuhn <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> 
> Not entirely.
> 
> Internal representation does matter somewhat when it comes to the handling
> of malformed UTF-8 sequences. I think it is highly desireable that the
> UTF-8 -> emacs internal -> UTF-8 conversion roundtrip is made 100% binary
> transparent. Loading and saving a file that contains malformed UTF-8
> sequences should not change them, but character encoding conversions are
> prone to throw away information in the case of invalid source byte
> streams.
> 
> Using UTF-8 as the internal Emacs encoding is one way of achieving
> continued guaranteed binary transparency, coming up with a tricky encoding
> for malformed UTF-8 sequences is another one. I favour the former
> approach, which is also what other UTF-8 capable modern editors do today.
> 

I don't think it's that hard, actually.  The best way to think of a
malformed UTF-8 sequence is as a unique character with no semantic
meaning.  Of course, what constitutes the "sequence" is somewhat
arbitrary, but ideally such an encoding should have the following
properties:

a) It cannot be used to encode valid characters.
b) It's unambiguous (only one possible encoding for any one possible
   sequence.)

If that can be obtained, it also solves all the security issues
involved with malformed sequences, since the fundamental cause of the
security hazards is aliasing.

For example, in a 32-bit word, one could use negative numbers for
these sequences.  Valid sequences up to 31 bits are of course
represented by their respective valid, positive numbers.

One definition of "sequence" that is reasonably easy to implement is
"lead byte followed by the number of continuation bytes it is supposed
to have, or until terminated by another lead byte; an isolated
continuation byte; or an isolated FE or FF."  Under that definition
the malformed byte stream "E0 80 80 80 80 80 80 80 80 80" consists of
an overlong 3-byte sequence followed 7 isolated continuation bytes. 

a) Isolated continuation bytes;
b) FE or FF bytes;
c) Overlong seqeuences (e.g. E0 80 80);
d) Truncated sequences (e.g. the first two bytes in E3 8C 61).
e) Encoded surrogate characters (e.g. ED A0 80).

In the case of (d) it may be possible to represent the surrogates by
their respective encoding as if UTF-16 had never existed, however, if
so, one needs to take care that nothing else will try to interpret
them.

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt<[EMAIL PROTECTED]>
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-27 Thread Eli Zaretskii

> From: Markus Kuhn <[EMAIL PROTECTED]>
> Date: Sat, 27 Oct 2001 19:27:51 +0100 (BST)
> 
> On Thu, 25 Oct 2001, Eli Zaretskii wrote:
> > > Is the internal representation still the special MULE format ??~
> >
> > Yes.  But the internal representation is not the problem here; ideally,
> > users and Lisp programs shouldn't be worrying about how characters are
> > represented internally.  The problem is that characters are still not
> > unified in Emacs 21.
> 
> Not entirely.
> 
> Internal representation does matter somewhat when it comes to the handling
> of malformed UTF-8 sequences. I think it is highly desireable that the
> UTF-8 -> emacs internal -> UTF-8 conversion roundtrip is made 100% binary
> transparent.

I think this already works in Emacs 21.1, even though the internal
representation is nowhere near UTF-8.  If you see something else,
please report that as a bug.

> Using UTF-8 as the internal Emacs encoding is one way of achieving
> continued guaranteed binary transparency, coming up with a tricky encoding
> for malformed UTF-8 sequences is another one. I favour the former
> approach, which is also what other UTF-8 capable modern editors do today.

Emacs cannot use a pure UTF-8 encoding, since some cultures don't want
unification, and it was decided that Emacs should not force
unification on those cultures.  So the planned Unicode-based internal
representation resembles UTF-8 very closely, but is not identical to
it.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-27 Thread Markus Kuhn

On Thu, 25 Oct 2001, Eli Zaretskii wrote:
> > Is the internal representation still the special MULE format ??~
>
> Yes.  But the internal representation is not the problem here; ideally,
> users and Lisp programs shouldn't be worrying about how characters are
> represented internally.  The problem is that characters are still not
> unified in Emacs 21.

Not entirely.

Internal representation does matter somewhat when it comes to the handling
of malformed UTF-8 sequences. I think it is highly desireable that the
UTF-8 -> emacs internal -> UTF-8 conversion roundtrip is made 100% binary
transparent. Loading and saving a file that contains malformed UTF-8
sequences should not change them, but character encoding conversions are
prone to throw away information in the case of invalid source byte
streams.

Using UTF-8 as the internal Emacs encoding is one way of achieving
continued guaranteed binary transparency, coming up with a tricky encoding
for malformed UTF-8 sequences is another one. I favour the former
approach, which is also what other UTF-8 capable modern editors do today.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-26 Thread Andreas Schwab

Eli Zaretskii <[EMAIL PROTECTED]> writes:

|> On Thu, 25 Oct 2001, Oliver Doepner wrote:
|> 
|> > my question: what happened in this area in Emacs 21 ??
|> 
|> What happened is that Emacs now supports Unicode characters that 
|> basically span the BMP with the exception of CJK ideographic characters.
|> It also has some initial support for UTF-8.

Note that the Mule-UCS package works fine with Emacs21 and allows you to
send UTF8 mails with Gnus, for example.

Andreas.

-- 
Andreas Schwab  "And now for something
[EMAIL PROTECTED]  completely different."
SuSE Labs, SuSE GmbH, Schanzäckerstr. 10, D-90443 Nürnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-25 Thread Eli Zaretskii


On Thu, 25 Oct 2001, Oliver Doepner wrote:

> my question: what happened in this area in Emacs 21 ??

What happened is that Emacs now supports Unicode characters that 
basically span the BMP with the exception of CJK ideographic characters.
It also has some initial support for UTF-8.

> Is the internal
> representation still the special MULE format ??~

Yes.  But the internal representation is not the problem here; ideally, 
users and Lisp programs shouldn't be worrying about how characters are 
represented internally.  The problem is that characters are still not 
unified in Emacs 21.  So we have two versions of Cyrillic characters, two 
versions of Greek characters, two versions of Hebrew characters, etc.: 
one version in the new Unicode set, the other version in the old Mule 
set.  And Emacs thinks these are different characters, so if you mix 
them without converting them, you are in trouble.

> And are there any plans and/or activities to achieve these things ?

Oh, we have plenty of plans!  The problem is with volunteers who would 
step forward and actually produce some code that implements those plans.

It might come as a surprise to some that the decision to change the 
internal representation of characters to something that is based on 
Unicode and that unifies the characters--that decision was made several 
years ago (beginning of 1998, to be exact).  At that time, discussions 
were held which produced a detailed design of the new representation.  
What remains is for few motivated individuals to sit down and code the 
darn thing.  Which is where we are today, more than 3 years later.

Lately, the emacs-unicode mailing list was revived, in the hope that it 
will boost the activity.  Sadly, the traffic on that list is nil.

What can I say except ``volunteers are welcome...'' etc.?  I can't 
believe no one wants Unicode badly enough to work on its support in 
Emacs, but what do I do with facts which fly in my face?
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-25 Thread D. Dale Gulledge

I haven't meant for anything I've written to indicate that Emacs is not
a useful editor for UTF-8 encoded text.  I have found it quite usable. 
I've had a couple of configuration headaches along to way specifically
because I am simultaneously maintaining files in both UTF-8 and Latin-3.

If the alphabets you use fall within the ranges of characters that Emacs
now handles, I can't see any strong argument not to use Emacs.  I
switched to the prereleases of Emacs 21 a few weeks ago specifically for
the Unicode support.  For me, there was really no option of choosing
anything else, even if I had wanted to.  I am doing some heavily
customized stuff supported by a pile of Emacs Lisp code tailored to my
data over the past 6 1/2 years.  Emacs Lisp has saved me hundred of
hours.

In the end, I would like to see Emacs use Unicode internally.

Oliver Doepner wrote:

> I was happy to see Emacs 21 announced. but the unicode support does not
> seem to have moved forward very much - as i have heard and read from some
> people.
> 
> my question: what happened in this area in Emacs 21 ?? Is the internal
> representation still the special MULE format ??~
> And are there any plans and/or activities to achieve these things ?

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



unicode in emacs 21

2001-10-25 Thread Oliver Doepner

hello,

my favourite editor is gnu emacs. i use it for html, java programming,
latex and much more.
so far i have been using emacs 20 on GNU/Linux with Otfried Cheongs utf-8
mode and Markus Kuhns UCS-fonts to work with utf8 files on the X11
version of gnu emacs.

i must say: I am not really satisfied with that. There is vim 6.x now with
full utf-8 support on the xterm. It is much faster than emacs on x11 of
course. But I don't like it yet because it's so different from good old 
Emacs ... :-(

I was happy to see Emacs 21 announced. but the unicode support does not
seem to have moved forward very much - as i have heard and read from some
people.

my question: what happened in this area in Emacs 21 ?? Is the internal
representation still the special MULE format ??~
And are there any plans and/or activities to achieve these things ?

since i am not an elisp hacker i am only asking at the moment.
any comments appreciated

oliver

--
http://www.coli.uni-sb.de/~oldo/

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/