Carl Lowenstein wrote:
> 2009/1/2 James G. Sack (jim) <[email protected]>:
>> Carl Lowenstein wrote:
>>> The particular bunch of text I was working with has a 3-byte
>>> representation for an apostrophe, E2 80 99. determined by using "od -t
>>> x1".  VIM displays this only as an apostrophe-like symbol, in one
>>> character cell.  Moving the cursor to that place and entering "ga"
>>> gives the result "hex 2019".  Which is not right at all.
>> Well, it is unicode code point 2019 ("RIGHT SINGLE QUOTATION MARK").
>> Check the "general punctuation" block in gucharmap (the "character map"
>> application in the gui menu).
>>
>>  UTF-8: 0xE2 0x80 0x99
>>  UTF-16: 0x2019
>>
>>  C octal escaped UTF-8: \342\200\231
>>  XML decimal entity: &#8217;
>>
> 
> Well, I'm confused.
> If it is UTF-16 0x2019 which is 2 bytes according to my calculation,
> how can it also be UTF-8 0xE2 0x80 0x99 which is 3 bytes.
> What it is in the byte stream should not be a function of the program
> used to interpret it.  I suppose VIM is trying to do the right thing,
> for some value of _right_.

I left out part of it's name. The character has a unicode ("code point")
identifier which is 2019, also a standard name, the full callout in
gucharmap being
  U+2019 RIGHT SINGLE QUOTATION MARK

The code point identifier happens to be hex and, in fact, the same as
the utf-16 encoding -- but only up to the end of the 64k long "basic
plane". For code points above (such as Linear B), utf-16 requires a
multi-part encoding.

> 
>>> I have not found a hex dump routine that produces as user-friendly a
>>> display as "od -cb" does for octal.  That is, parallel lines of
>>> character and numeric representations of each byte, with the same
>>> horizontal spacing so it is obvious what belongs together.
>> is there a way to get hex-pairs instead of octal?
> 
> $ od -tx1 -c luigi.txt
> 0000000 4c 75 69 67 69 e2 80 99 73 20 50 69 7a 7a 61 0a
>           L   u   i   g   i 342 200 231   s       P   i   z   z   a  \n
> 0000020
> 
> Yes but the hex-pairs occupy 3 character cells and the ASCII occupies
> 4 character cells so they don't line up on the screen.  Fixed-width
> font, of course.
> A simple awk or sed script could fix this, and I have probably done
> that in the past but didn't write it down anywhere.
> 
> 
>> Yeah, that's the unicode char in there, all right.
>> I attach a python program to list all the (non-ascii) unicode, which you
>> may find useful. The b=# value is the (zero-based)byte offset.
> 
> As usual, attachments get scrubbed by the mailing list.

Oops, forgot.
download it from http://pub.jgsack.net/kplug/

> 
>> It probably isn't very useful unless the input is actually UTF-8
>> unicode. ;-)
>>
>>> Side note.  uuencode/uudecode seem to have disappeared from modern
>>> Linux systems.  Their replacement is called uuenview/uudeview, and is
>>> almost but not exactly compatible.  Notice the lack of the second
>>> space after "begin".  I don't know what this might do with a real
>>> uudecode.
>> I have uuencode/uudecode from a sharutils rpm package. It decoded
>> luigi.txt fine.
> 
> Somehow I forgot about sharutils when I was seeking software with
> YumExtender.  It doesn't come up in a search for uuencode.

Regards,
..jim

-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-newbie

Reply via email to