There is no other solution to detect UTF-8 without BOM so csc.exe has to do
the same.:) But this test could be done only on the first n bytes of a
stream then it could be assumed that the rest of the stream has the same
encoding.

Kornél

----- Original Message -----
From: "Atsushi Eno" <[EMAIL PROTECTED]>
To: "Kornél Pál" <[EMAIL PROTECTED]>
Cc: "mono-devel mailing list" <[email protected]>; "Marek
Safar" <[EMAIL PROTECTED]>
Sent: Tuesday, August 23, 2005 11:50 AM
Subject: Re: [Mono-dev] mcs patch for default encoding


I don't think this is acceptable because of its significant
performance loss (reading the entire stream)...

Atsushi Eno

Kornél Pál wrote:
Hi,

Character set detection.

This code uses a UTF8Encoding with throwOnInvalidBytes. StreamReader
detects
BOM (UTF-8, Unicode, Unicode (Big-Endian)). UTF-8 is easy to validate as
it
has strict rules regarding the byte
representation of character. So it's safe to assume that a text is UTF-8
if
it can be parsed as UTF-8. UTF8Encoding (with throwOnInvalidBytes) throws
ArgumentException when it is
not UTF-8. In this case fall back to Encoding.Default.

Unicode (16-bit) is not detected by csc.exe without BOM so I think we
shouldn't deal with it.

Kornél

_______________________________________________
Mono-devel-list mailing list
[email protected]
http://lists.ximian.com/mailman/listinfo/mono-devel-list


_______________________________________________
Mono-devel-list mailing list
[email protected]
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Reply via email to