Re: Unpaired surrogates

2015-10-20 Thread Asmus Freytag (t)
When it comes to methods operating on buffers there's always the tension between viewing the buffer as text elements vs. as data elements. For some purposes, from error detection to data cleanup you need to be able to treat the buffer as data elements. For many ot

Re: Unpaired surrogates

2015-10-20 Thread Philippe Verdy
2015-10-20 2:07 GMT+02:00 Richard Wordingham < richard.wording...@ntlworld.com>: > Now, as we know, UTF-32 does not handle the full range of Unicode code > points; ??? All valid UTFs handle the full range of valid Unicode code points. This includes UTF-32 as well as UTF-16 and UTF-8 (and their v

Re: Unpaired surrogates

2015-10-19 Thread Richard Wordingham
On Mon, 19 Oct 2015 13:32:07 -0700 "Doug Ewell" wrote: > Richard Wordingham wrote: > > It was the once the > > case that basic Unicode support in regular expressions required a > > regular expression engine to be able to search for specified lone > > surrogates - a real show-stopper for an engin

Re: Unpaired surrogates (was: Re: Why Work at Encoding Level?)

2015-10-19 Thread Markus Scherer
On Mon, Oct 19, 2015 at 1:32 PM, Doug Ewell wrote: > > ICU (but perhaps it's actually Java) seems to have a culture of > > tolerating lone surrogates, and rules for handling lone surrogates are > > strewn across the Unicode standards and annexes. > > I suspect you have an example. I have exampl

Re: Unpaired surrogates (was: Re: Why Work at Encoding Level?)

2015-10-19 Thread Philippe Verdy
2015-10-19 22:32 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > No ! The "supplementary code points" (or "supplementary characters" > > when they are assigned to characters) are represented in UTF-16 as two > > **code units**, NOT as two "code points" (even if their binary value > > are rela

Unpaired surrogates (was: Re: Why Work at Encoding Level?)

2015-10-19 Thread Doug Ewell
Richard Wordingham wrote: >> This discussion was originally about how to handle unpaired >> surrogates, as if that were a normal use case. > > And the subject line was changed when the topic changed to > traversing strings. Granted. I've changed it again to reflect t