Re: print hex value of cp936 Chinese char?

2007-03-07 Thread Cyril Slobin

On 3/7/07, Joseph WU <[EMAIL PROTECTED]> wrote:


Tony's right. I just tested Cyril's solution. Unfortunately,  when
encoding=utf8 and fileencoding=cp936, line: "let char =
matchstr(getline("."), ".", col(".") - 1)" can only get the *FIRST*
8-bit out of a 2-byte cp936-encoded Chinese character. (To my
surprise, the hex value of the first 8-bit converted by iconv is
correct in cp936 encode standard. I just realized that iconv can just
convert a part of one Chinese Character.:)

I am trying to see if I can read the whole 2-byte Chinese char out or not.


Try this:

let char_int = matchstr(getline("."), ".", col(".") - 1)
let char_ext = iconv(char_int, &encoding, &fileencoding)
if len(char_ext) == 1
 let code_ext = char2nr(char_ext[0])
 echo printf("0x%02X", code_ext)
else
 let code_ext = char2nr(char_ext[0]) * 256 + char2nr(char_ext[1])
 echo printf("0x%04X", code_ext)
endif

This should work if `encoding` is utf-8 and `fileencoding` is NOT
utf-8 (but any 1-byte or 2-byte encoding should work). This should
however NOT work if `encoding` is other than utf-8 or if
`fileencodibg` is utf-8 itself. And check for byte order -- I'm not
sure whether  big-endian or little-endian is correct here.

--
Cyril Slobin <[EMAIL PROTECTED]> `When I use a word,' Humpty Dumpty said,
 `it means just what I choose it to mean'


Re: print hex value of cp936 Chinese char?

2007-03-07 Thread Zhaojun WU

Oops, josephwu is my another account. Gmail just wrongly chose my out
mail addr. :(

Best,
Zhaojun

On 3/8/07, Joseph WU <[EMAIL PROTECTED]> wrote:

Hi, Cyril and Tony,

Thanks.

Tony's right. I just tested Cyril's solution. Unfortunately,  when
encoding=utf8 and fileencoding=cp936, line: "let char =
matchstr(getline("."), ".", col(".") - 1)" can only get the *FIRST*
8-bit out of a 2-byte cp936-encoded Chinese character. (To my
surprise, the hex value of the first 8-bit converted by iconv is
correct in cp936 encode standard. I just realized that iconv can just
convert a part of one Chinese Character.:)

I am trying to see if I can read the whole 2-byte Chinese char out or not.

Thanks,

Zhaojun

On 3/8/07, Cyril Slobin <[EMAIL PROTECTED]> wrote:
> On 3/7/07, A.J.Mechelynck <[EMAIL PROTECTED]> wrote:
>
> > IIUC, the above is for 'encoding' and 'fileencoding' being both 8-bit
> > encodings. cp936 is the Microsoft encoding for mainland China: some 
characters
> > (such as ASCII) are 8 bits, others are 16 bits; and UTF-8 (the 'encoding'
> > Zhaojun uses) can use anywhere between 1 and 4 bytes to represent the
> > "assigned" codepoints. For instance, the highest Unicode codepoint currently
> > regarded as "valid" by the Unicode Consortium, U+10FFFD, is represented in
> > UTF-8 as F4 8F BF BD.
>
> No! The very goal of this code is that `encodig` is utf-8 while
> `fileencoding` is different. But you are right noting that I have not
> tested this `fileencoding` other than 8-bit one -- except the marginal
> case when it is utf-8 too. But again -- `encoding` *is* multibyte and
> it works. Why not just to test? I haven't Chinese fonts installed...
>
> --
> Cyril Slobin <[EMAIL PROTECTED]> `When I use a word,' Humpty Dumpty said,
>  `it means just what I choose it to mean'
>


--
Best,
Zhaojun (Joseph)




--
Best,
Zhaojun (Joseph)


Re: print hex value of cp936 Chinese char?

2007-03-07 Thread Joseph WU

Hi, Cyril and Tony,

Thanks.

Tony's right. I just tested Cyril's solution. Unfortunately,  when
encoding=utf8 and fileencoding=cp936, line: "let char =
matchstr(getline("."), ".", col(".") - 1)" can only get the *FIRST*
8-bit out of a 2-byte cp936-encoded Chinese character. (To my
surprise, the hex value of the first 8-bit converted by iconv is
correct in cp936 encode standard. I just realized that iconv can just
convert a part of one Chinese Character.:)

I am trying to see if I can read the whole 2-byte Chinese char out or not.

Thanks,

Zhaojun

On 3/8/07, Cyril Slobin <[EMAIL PROTECTED]> wrote:

On 3/7/07, A.J.Mechelynck <[EMAIL PROTECTED]> wrote:

> IIUC, the above is for 'encoding' and 'fileencoding' being both 8-bit
> encodings. cp936 is the Microsoft encoding for mainland China: some characters
> (such as ASCII) are 8 bits, others are 16 bits; and UTF-8 (the 'encoding'
> Zhaojun uses) can use anywhere between 1 and 4 bytes to represent the
> "assigned" codepoints. For instance, the highest Unicode codepoint currently
> regarded as "valid" by the Unicode Consortium, U+10FFFD, is represented in
> UTF-8 as F4 8F BF BD.

No! The very goal of this code is that `encodig` is utf-8 while
`fileencoding` is different. But you are right noting that I have not
tested this `fileencoding` other than 8-bit one -- except the marginal
case when it is utf-8 too. But again -- `encoding` *is* multibyte and
it works. Why not just to test? I haven't Chinese fonts installed...

--
Cyril Slobin <[EMAIL PROTECTED]> `When I use a word,' Humpty Dumpty said,
 `it means just what I choose it to mean'




--
Best,
Zhaojun (Joseph)


Re: print hex value of cp936 Chinese char?

2007-03-07 Thread Cyril Slobin

On 3/7/07, A.J.Mechelynck <[EMAIL PROTECTED]> wrote:


IIUC, the above is for 'encoding' and 'fileencoding' being both 8-bit
encodings. cp936 is the Microsoft encoding for mainland China: some characters
(such as ASCII) are 8 bits, others are 16 bits; and UTF-8 (the 'encoding'
Zhaojun uses) can use anywhere between 1 and 4 bytes to represent the
"assigned" codepoints. For instance, the highest Unicode codepoint currently
regarded as "valid" by the Unicode Consortium, U+10FFFD, is represented in
UTF-8 as F4 8F BF BD.


No! The very goal of this code is that `encodig` is utf-8 while
`fileencoding` is different. But you are right noting that I have not
tested this `fileencoding` other than 8-bit one -- except the marginal
case when it is utf-8 too. But again -- `encoding` *is* multibyte and
it works. Why not just to test? I haven't Chinese fonts installed...

--
Cyril Slobin <[EMAIL PROTECTED]> `When I use a word,' Humpty Dumpty said,
 `it means just what I choose it to mean'


Re: print hex value of cp936 Chinese char?

2007-03-07 Thread A.J.Mechelynck

Cyril Slobin wrote:

On 3/7/07, A.J.Mechelynck <[EMAIL PROTECTED]> wrote:

As long as 'encoding' is set to UTF-8, there is no easy way to get the 
cp936

value of a given character in the buffer.


I don't know if there is something specific with cp936, but for
encodings I use daily
(cp866, cp1251, koi8-r) the following works:

set statusline=  %{HexDec()} 



function! HexDec()
 let char = matchstr(getline("."), ".", col(".") - 1)
 if g:EXTERNAL
   let char = iconv(char, &encoding, &fileencoding)
   let format = "0x%02X <%d>"
 else
   let format = "0x%02X (%d)"
 endif
 let char = char2nr(char)
 return printf(format, char, char)
endfunction



nmap   :let EXTERNAL = !EXTERNAL
imap  

Code of a character under cursor is always displayed in statusline (in
both hex and dec),
and I can switch between `encoding` and `fileencoding`.



IIUC, the above is for 'encoding' and 'fileencoding' being both 8-bit 
encodings. cp936 is the Microsoft encoding for mainland China: some characters 
(such as ASCII) are 8 bits, others are 16 bits; and UTF-8 (the 'encoding' 
Zhaojun uses) can use anywhere between 1 and 4 bytes to represent the 
"assigned" codepoints. For instance, the highest Unicode codepoint currently 
regarded as "valid" by the Unicode Consortium, U+10FFFD, is represented in 
UTF-8 as F4 8F BF BD.


Best regadrs,
Tony.
--
Don't take life so serious, son, it ain't nohow permanent.
-- Walt Kelly


Re: print hex value of cp936 Chinese char?

2007-03-07 Thread Cyril Slobin

On 3/7/07, A.J.Mechelynck <[EMAIL PROTECTED]> wrote:


As long as 'encoding' is set to UTF-8, there is no easy way to get the cp936
value of a given character in the buffer.


I don't know if there is something specific with cp936, but for
encodings I use daily
(cp866, cp1251, koi8-r) the following works:

set statusline=  %{HexDec()} 



function! HexDec()
 let char = matchstr(getline("."), ".", col(".") - 1)
 if g:EXTERNAL
   let char = iconv(char, &encoding, &fileencoding)
   let format = "0x%02X <%d>"
 else
   let format = "0x%02X (%d)"
 endif
 let char = char2nr(char)
 return printf(format, char, char)
endfunction



nmap   :let EXTERNAL = !EXTERNAL
imap  

Code of a character under cursor is always displayed in statusline (in
both hex and dec),
and I can switch between `encoding` and `fileencoding`.

--
Cyril Slobin <[EMAIL PROTECTED]> `When I use a word,' Humpty Dumpty said,
 `it means just what I choose it to mean'


Re: print hex value of cp936 Chinese char?

2007-03-07 Thread A.J.Mechelynck

Zhaojun WU wrote:

Hi, Tony,

Thanks for you reply.

I've already tried the xxd solution, but it converts all of the
characters into hex values, it is hard to locate a particular
character's value. It might be possible to copy this character out to
a new window and use ":%!xxd" to check its hex value, but it takes two
steps with an additional window. It is acceptable but a little bit
inconvenient.
(Is there any shortcut (i mean key-combination or function) for these 
steps?)

[...]

As long as 'encoding' is set to UTF-8, there is no easy way to get the cp936 
value of a given character in the buffer. It would be possible to write the 
character to a cp936-encoded file and convert that to hex; once the required 
function or command has been written in vimscript, you can map them to a key. 
I'm not trying, but here are the steps I can think of:


1) (This is the big one) Write a function to create a file (probably with the 
help of "tempname()") containing a string (given as parameter) in a given 
'fileencoding' (given as another parameter), apply xxd to that file, and 
":echo" the result (using ":echo system(xxd tempfilename)". Conversion 
failures should be handled somehow.

2) (optional) Write a command or commands to invoke that function.
3) Map the invocation to a key.


Best regards,
Tony.
--
There were in this country two very large monopolies.  The larger of
the two had the following record: the Vietnam War, Watergate, double-
digit inflation, fuel and energy shortages, bankrupt airlines, and the
8-cent postcard.  The second was responsible for such things as the
transistor, the solar cell, lasers, synthetic crystals, high fidelity
stereo recording, sound motion pictures, radio astronomy, negative
feedback, magnetic tape, magnetic "bubbles", electronic switching
systems, microwave radio and TV relay systems, information theory, the
first electrical digital computer, and the first communications
satellite.  Guess which one got to tell the other how to run the
telephone business?


Re: print hex value of cp936 Chinese char?

2007-03-07 Thread Zhaojun WU

Hi, Tony,

Thanks for you reply.

I've already tried the xxd solution, but it converts all of the
characters into hex values, it is hard to locate a particular
character's value. It might be possible to copy this character out to
a new window and use ":%!xxd" to check its hex value, but it takes two
steps with an additional window. It is acceptable but a little bit
inconvenient.
(Is there any shortcut (i mean key-combination or function) for these steps?)

For the 2nd solution you proposed, it is not suitable for me since I
always need to handle many files with different encodings. That's the
reason why I set the encoding as utf-8.

Anyway, thanks for your idea. :)

Zhaojun

On 3/7/07, A.J.Mechelynck <[EMAIL PROTECTED]> wrote:

":set encoding=utf-8" tells Vim to use UTF-8 internally to represent all file
data. This is usually OK; but the replies to ga and g8 will be based on what
is in memory, i.e., the UTF-8 equivalents of the character at the cursor.

If you want to examine the actual cp936 data at a given point in the file, I
can think of two methods:

Method I. Convert to hex display.

This can be done by means of the "xxd" utility, which is normally distributed
together with Vim, as follows:

xxd < filename.txt > filename.hex

"filename.hex", which is a text file, will then contain the hex values of all
bytes in the file, with at left the offset within the file in hex, and at
right the "text" as 16 ASCII characters per line, with unprintable bytes
replaced by dots.


Method II. Use cp936 as internal encoding. (Untested)

This means setting 'encoding' to cp936 rather than utf-8. Beware! You should
have no file in any other multibyte encoding in the same instance of Vim, not
even in a different window, not even hidden-but-modified. Better start afresh
with a new run of gvim, edit only that single file, use ga as necessary, and
close Vim when done.


Best regards,
Tony.
--
It's a damn poor mind that can only think of one way to spell a word.
-- Andrew Jackson



Re: print hex value of cp936 Chinese char?

2007-03-07 Thread A.J.Mechelynck

Zhaojun WU wrote:

Hi, all,

Is it possible to print the hex value of the cp936-coded Chinese
character under the current cursor, just like "ga" for the ASCII char.

I found that "g8" can print the correct hex value of UTF8-encoded
Chinese character in a openning UTF-8 encoded file.

But, for a cp936 (or GBK) encoded file, although gvim can auto-detect
it as cp936 and open it successfully, I still cannot figure out how to
get the hex value of a Chinese character.

In this case (I mean opening a cp936 file in GVIM), I tried "ga" and
"g8" for fun. "ga" cannot work as I expected, "g8" continues to print
the hex value of the Chinese character's corresponding UTF8 encode,
not the cp936 one that I need. I think the result of "g8" is because
of gvim's implementation that it treats all the characters' encoding
as UTF-8 internally. Am I right?

Again, is it possible for me to print the hex value of a cp936-encoded
Chinese char?

BTW, the encoding related settings in my .vimrc are:

set encoding=utf-8
set fileencodings=ucs-bom,utf-8,cp936,latin1

Thanks,


":set encoding=utf-8" tells Vim to use UTF-8 internally to represent all file 
data. This is usually OK; but the replies to ga and g8 will be based on what 
is in memory, i.e., the UTF-8 equivalents of the character at the cursor.


If you want to examine the actual cp936 data at a given point in the file, I 
can think of two methods:



Method I. Convert to hex display.

This can be done by means of the "xxd" utility, which is normally distributed 
together with Vim, as follows:


xxd < filename.txt > filename.hex

"filename.hex", which is a text file, will then contain the hex values of all 
bytes in the file, with at left the offset within the file in hex, and at 
right the "text" as 16 ASCII characters per line, with unprintable bytes 
replaced by dots.



Method II. Use cp936 as internal encoding. (Untested)

This means setting 'encoding' to cp936 rather than utf-8. Beware! You should 
have no file in any other multibyte encoding in the same instance of Vim, not 
even in a different window, not even hidden-but-modified. Better start afresh 
with a new run of gvim, edit only that single file, use ga as necessary, and 
close Vim when done.



Best regards,
Tony.
--
It's a damn poor mind that can only think of one way to spell a word.
-- Andrew Jackson


print hex value of cp936 Chinese char?

2007-03-07 Thread Zhaojun WU

Hi, all,

Is it possible to print the hex value of the cp936-coded Chinese
character under the current cursor, just like "ga" for the ASCII char.

I found that "g8" can print the correct hex value of UTF8-encoded
Chinese character in a openning UTF-8 encoded file.

But, for a cp936 (or GBK) encoded file, although gvim can auto-detect
it as cp936 and open it successfully, I still cannot figure out how to
get the hex value of a Chinese character.

In this case (I mean opening a cp936 file in GVIM), I tried "ga" and
"g8" for fun. "ga" cannot work as I expected, "g8" continues to print
the hex value of the Chinese character's corresponding UTF8 encode,
not the cp936 one that I need. I think the result of "g8" is because
of gvim's implementation that it treats all the characters' encoding
as UTF-8 internally. Am I right?

Again, is it possible for me to print the hex value of a cp936-encoded
Chinese char?

BTW, the encoding related settings in my .vimrc are:

set encoding=utf-8
set fileencodings=ucs-bom,utf-8,cp936,latin1

Thanks,
--
Best,
Zhaojun