Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Doug Ewell
Markus Scherer wrote: Right, but as I said, those downstream tasks shouldn't be consumers of UTF-16 code units anyway. They should be consumers of Unicode code points, which by definition excludes loose surrogates. Code points include surrogates. Maybe you mean "UTF-32 code units" or "Unico

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Markus Scherer
On Fri, Nov 5, 2010 at 1:56 PM, Doug Ewell wrote: > Right, but as I said, those downstream tasks shouldn't be consumers of > UTF-16 code units anyway. They should be consumers of Unicode code > points, which by definition excludes loose surrogates. > Code points include surrogates. Maybe you me

RE: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Doug Ewell
Asmus Freytag wrote: >> Doing conversion and validation at different stages isn't a great >> idea; that's how character encodings get involved with security >> problems. > > Note that I am careful not to suggest that (and I'm sure Markus isn't > either). "Handling" includes much more than code co

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Mark Davis ☕
I'm in general agreement. 1. A Unicode 16-bit string can contain any sequence of 16-bit code units: it might or might not be valid UTF-16. 2. Whenever a process is emitting a Unicode string, if it is * guaranteeing* that it is UTF-16, it must catch any unpaired surrogates and fix (e

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Asmus Freytag
On 11/5/2010 7:02 AM, Doug Ewell wrote: Asmus Freytag wrote: I'm probably missing something here, but I don't agree that it's OK for a consumer of UTF-16 to accept an unpaired surrogate without throwing an error, or converting it to U+FFFD, or otherwise raising a fuss. Unpaired surrogates are

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Doug Ewell
Asmus Freytag wrote: >> I'm probably missing something here, but I don't agree that it's OK >> for a consumer of UTF-16 to accept an unpaired surrogate without >> throwing an error, or converting it to U+FFFD, or otherwise raising a >> fuss. Unpaired surrogates are ill-formed, and have to be caug

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Martin J. Dürst
On 2010/11/05 8:30, Markus Scherer wrote: If the conversion libraries you are using do not support this (I don't know), then you could ask for such options. Or use conversion libraries that do support such options (like ICU and Java). The encoding conversion library in Ruby 1.9 also supports t

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Martin J. Dürst
On 2010/11/05 2:46, Markus Scherer wrote: 16-bit Unicode is convenient in that when you find an unpaired surrogate (that is, it's not well-formed UTF-16) you can usually just treat it like a surrogate code point which normally has default properties much like an unassigned code point or nonchara