Re: [Lynx-dev] Unicode-marking, &c

Thorsten Glaser Fri, 27 Feb 2009 01:34:55 -0800

David Woolley dixit:

>> Here under Windows there are constant references to the character that
>> begins a 16-bit-wide-character file (FF FE) or UTF-8 file (EF BB BF).
>
> These are all valid printable characters in ISO 8859/x.  Although somewhat
> unlikely combinations, they are not reserved sequences.


We are talking about a file that does _begin_ with these byte sequences
here, not a file that solely consists of them.

For UCS-* the things are quite clear, you get <\0h\0t\0m\0l\0> so it
obviously is not any 8-bit encoding.

For UTF-8, itâ€™s not that easy, but:

â€¢ If the file is UTF-8 and uses any nÅn-ASCII characters, it almost
  always will contain an octet from the [0x80â€¥0x9F] range, which
  practically rules it out from being encoded as latin1

â€¢ In case of doubt: If the file contains only valid UTF-8 with no
  encoding errors (invalid multibyte sequences), lean towards it,
  as itâ€™s the current standard replacing the 8-bit character sets

â€¢ If the file only contains ASCII characters, while point #1 above
  is no longer valid, the difference is moot anyway

bye,
//mirabilos
-- 
â€œIt is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.â€
        -- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2


_______________________________________________
Lynx-dev mailing list
Lynx-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/lynx-dev

Re: [Lynx-dev] Unicode-marking, &c

Reply via email to