2009/1/2 James G. Sack (jim) <[email protected]>:
> Carl Lowenstein wrote:
>>
>> The particular bunch of text I was working with has a 3-byte
>> representation for an apostrophe, E2 80 99. determined by using "od -t
>> x1". VIM displays this only as an apostrophe-like symbol, in one
>> character cell. Moving the cursor to that place and entering "ga"
>> gives the result "hex 2019". Which is not right at all.
>
> Well, it is unicode code point 2019 ("RIGHT SINGLE QUOTATION MARK").
> Check the "general punctuation" block in gucharmap (the "character map"
> application in the gui menu).
>
> UTF-8: 0xE2 0x80 0x99
> UTF-16: 0x2019
>
> C octal escaped UTF-8: \342\200\231
> XML decimal entity: ’
>
Well, I'm confused.
If it is UTF-16 0x2019 which is 2 bytes according to my calculation,
how can it also be UTF-8 0xE2 0x80 0x99 which is 3 bytes.
What it is in the byte stream should not be a function of the program
used to interpret it. I suppose VIM is trying to do the right thing,
for some value of _right_.
>> I have not found a hex dump routine that produces as user-friendly a
>> display as "od -cb" does for octal. That is, parallel lines of
>> character and numeric representations of each byte, with the same
>> horizontal spacing so it is obvious what belongs together.
>
> is there a way to get hex-pairs instead of octal?
$ od -tx1 -c luigi.txt
0000000 4c 75 69 67 69 e2 80 99 73 20 50 69 7a 7a 61 0a
L u i g i 342 200 231 s P i z z a \n
0000020
Yes but the hex-pairs occupy 3 character cells and the ASCII occupies
4 character cells so they don't line up on the screen. Fixed-width
font, of course.
A simple awk or sed script could fix this, and I have probably done
that in the past but didn't write it down anywhere.
> Yeah, that's the unicode char in there, all right.
> I attach a python program to list all the (non-ascii) unicode, which you
> may find useful. The b=# value is the (zero-based)byte offset.
As usual, attachments get scrubbed by the mailing list.
> It probably isn't very useful unless the input is actually UTF-8
> unicode. ;-)
>
>>
>> Side note. uuencode/uudecode seem to have disappeared from modern
>> Linux systems. Their replacement is called uuenview/uudeview, and is
>> almost but not exactly compatible. Notice the lack of the second
>> space after "begin". I don't know what this might do with a real
>> uudecode.
>
> I have uuencode/uudecode from a sharutils rpm package. It decoded
> luigi.txt fine.
Somehow I forgot about sharutils when I was seeking software with
YumExtender. It doesn't come up in a search for uuencode.
carl
--
carl lowenstein marine physical lab u.c. san diego
[email protected]
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-newbie