Re: =?utf-8?B?w7HDpcOzIMOkw6PDqcOlw68gw6Q=?==?utf-8?B?w67DucOyw67DrQ==?=

Tzafrir Cohen Mon, 01 Sep 2003 21:19:46 +0000

On Mon, Sep 01, 2003 at 03:02:21PM -0400, Vadim Vygonets wrote:
> Quoth Tzafrir Cohen on Mon, Sep 01, 2003:
> > A small test (I hope you won't mind the Hebrew):
> 
> [snip -- can't do Hebrew ATM]
> 
> > It should have given the same output. Indeed the range between the Yud
> > and the Tav worked, so the regex worked on multibyte Hebrew chars.
> 
> No it didn't.  It replaced vav, which is not between yud and tav.
> I tried replacing the range yud-to-lamed, and it happily gave me
> the same output (i.e., it replaced shin as well).  Something is
> wrong here; and if you think for a second how sed works and how
> UTF-8 is encoded, you will immediately see what it is.


So it seems it was working on the bytes level after-all (and not
replacing the Vav). OTOH: even when I put multiple Lamed-s, I got the
same output.

> 
> Try to do "| sed s/....../foo/" and see what happens -- you will
> get "fooM", where M is mem sofit.

I'm not exactly sure what should happen. On a redhat 9.0 computer I get
different results.

I figure the behaviour here is still "undefined"

> 
> > And I had hell of a time editing this: I practically couldn't insert
> > text, because bash calculated internally Hebrew chars as taking two
> > places (assumed here char==byte).
> 
> I used mlterm to test it, and my zsh had problems as well.
> (mlterm 2.7.0, zsh 4.0.6, FreeBSD 4.8-STABLE)

tcsh and zsh on RH7.3 simply don't support multi-byte chars. They
display UTF-8 as two different chars. The same goes for tcsh on RH9. I
couldn't check for zsh. Is this a matter of missig some compile-time
switches?

> 
> > But this is RedHat 7.3, and the version of bash doesn't support UTF-8
> > well enough. In RH9 it seems much better. 

I checked it, and it is indeed working well (allows editing). Consider
7.3 a sort of "pre-release" regarding unicode support.

> 
> That's exactly what I'm talking about.  That thing supports this
> encoding, this thing doesn't, and what you have *in the end* is a
> system which, in some rare situations, can take Unicode text and
> deal with it, but mostly it can't.  The assumption of single-byte
> characters shines through, and if you're not careful it bites
> you.

When you have a file name on your system, what exactly does it mean?

> 
> > > Good to know, thanks.  Will mutt re-code text from anything to
> > > Unicode?
> > 
> > Yes. (Thus is generally more "sensetive" than most GUI clients to bad
> > encoding, as overriding bad encoding tends to be a less than trivial
> > operation)
> 
> You lost me here.  What do you mean by overriding bad encoding,
> and what do other apps do?

Look at the title of this thread. (Though I know of no easy way to
override the subject and sender/recepients names in standard GUIs). 
Think if the same happens to the content.

-- 
Tzafrir Cohen                       +---------------------------+
http://www.technion.ac.il/~tzafrir/ |vim is a mutt's best friend|
mailto:[EMAIL PROTECTED]       +---------------------------+

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: =?utf-8?B?w7HDpcOzIMOkw6PDqcOlw68gw6Q=?==?utf-8?B?w67DucOyw67DrQ==?=

Reply via email to