Markus Scherer wrote:
Right, but as I said, those downstream tasks shouldn't be consumers
of UTF-16 code units anyway. They should be consumers of Unicode
code points, which by definition excludes loose surrogates.
Code points include surrogates. Maybe you mean "UTF-32 code units" or
"Unico
On Fri, Nov 5, 2010 at 1:56 PM, Doug Ewell wrote:
> Right, but as I said, those downstream tasks shouldn't be consumers of
> UTF-16 code units anyway. They should be consumers of Unicode code
> points, which by definition excludes loose surrogates.
>
Code points include surrogates. Maybe you me
Asmus Freytag wrote:
>> Doing conversion and validation at different stages isn't a great
>> idea; that's how character encodings get involved with security
>> problems.
>
> Note that I am careful not to suggest that (and I'm sure Markus isn't
> either). "Handling" includes much more than code co
I'm in general agreement.
1. A Unicode 16-bit string can contain any sequence of 16-bit code units:
it might or might not be valid UTF-16.
2. Whenever a process is emitting a Unicode string, if it is *
guaranteeing* that it is UTF-16, it must catch any unpaired surrogates
and fix (e
On 11/5/2010 7:02 AM, Doug Ewell wrote:
Asmus Freytag wrote:
I'm probably missing something here, but I don't agree that it's OK
for a consumer of UTF-16 to accept an unpaired surrogate without
throwing an error, or converting it to U+FFFD, or otherwise raising a
fuss. Unpaired surrogates are
Asmus Freytag wrote:
>> I'm probably missing something here, but I don't agree that it's OK
>> for a consumer of UTF-16 to accept an unpaired surrogate without
>> throwing an error, or converting it to U+FFFD, or otherwise raising a
>> fuss. Unpaired surrogates are ill-formed, and have to be caug
On 2010/11/05 8:30, Markus Scherer wrote:
If the conversion libraries you are using do not support this (I don't
know), then you could ask for such options. Or use conversion libraries that
do support such options (like ICU and Java).
The encoding conversion library in Ruby 1.9 also supports t
On 2010/11/05 2:46, Markus Scherer wrote:
16-bit Unicode is convenient in that when you find an unpaired surrogate
(that is, it's not well-formed UTF-16) you can usually just treat it like a
surrogate code point which normally has default properties much like an
unassigned code point or nonchara
8 matches
Mail list logo