Brian McKee wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On 18-Sep-06, at 11:56 AM, David Morel wrote:

Brian McKee a écrit :
file Localizable.strings
Localizable.strings: Big-endian UTF-16 Unicode C program character data
If I open that file in vim I get
??^@/[EMAIL PROTECTED]@ [EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@ 
[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]
but Text Edit displays it correctly.
Can vi handle this type of file?  If so, how?
in vim, type :h multibyte
that should get you started :)

Eeeek - started right around the bend I think :-)
Biggest issue from my current point of view is it studiously ignores Mac OS...

Chris Eidhof suggested
set encoding=utf8
set fileencoding=utf8

which works if you set it before you open the file in question.
Interestingly =utf16 'works' too... or at least it shows plain ASCII type lettering ok.

Between those ideas I've decided to leave things alone and just do a
   :e ++enc=utf16
whenever I see lots of alternating @ signs and letters :-)
I think I'd prefer leaving my standard encoding at latin1 to match the linux
boxes I'm often working on at the same time.

Am I right in understanding that Apple's TextEdit must be automatically
detecting UTF16 files and changing it's base encoding to match?

And is there some way that vi could do the same?

Brian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFFDuvUGnOmb9xIQHQRAi6hAJ9858onQRWXR+kByXCcm/Cpk631bACg2cbB
e2JH8drOIyERomjI7zpPTn0=
=Wa4n
-----END PGP SIGNATURE-----



Your example looks like UTF-16 (or UCS-2) text, i.e. Unicode encoded at two bytes per character for most characters. Such text may contain characters (Chinese, Russian, Hebrew, Greek, Arabic, whatever) which canot be represented in latin1. I suggest the following (in gvim):

        if &termencoding == ""
                let &termencoding = &encoding
        endif
        set encoding=utf-8
        set fileencodings=ucs-bom,utf-8,latin1

Here's an explanation:

'termencoding' defines how your keyboard encodes the data. The default is empty, which means "fallback to 'encoding'". If you change 'encoding', you should keep 'termencoding' at the _old_ value of 'encoding', the one which was set according to your OS locale.

'encoding' defines how Vim represents the data in memory. For all Unicode encodings, Vim actually uses UTF-8 internally, because other Unicode encodings uses null bytes within the data, and that is incompatible with the way the C language encodes strings.

'fileencodings' (plural) defines which heuristics Vim will use to "guess" the 'fileencoding' (singular) of an editfile when opening it. "ucs-bom" means "check for a BOM at the start of the file". The BOM is the codepoint U+FEFF ZERO-WIDTH NO-BREAK SPACE (which is deprecated except as an encoding marker). It looks like your file has one; each Unicode encoding has a different disk representation for it (here in hex):

UTF-8:       EF BB BF
UTF-16be:    FE FF
UTF-16le:    FF FE
UTF-32be:    00 00 FE FF
UTF-32le:    FF FE 00 00

The encodings mentioned in 'fileencodings' are tested from left to right. 'ucs-bom', if present, should be first; and since 8-bit encodings never give an "error signal" (every byte is valid in an 8-bit encoding), there should be at most one 8-bit encoding (such as latin1) and, if present, it should come last.

After setting the above settings, Vim should open correctly any Unicode file with BOM (like yours seems to be) and any UTF-8 file. 7-bit US-ASCII files will be seen as UTF-8 (which is compatible in the 0x00-0x7F range) and Latin1 files which include accented characters or other bytes in the range 0x80-0xFF, will be opened as latin1.


Best regards,
Tony.

Reply via email to