RE: What does it mean to "not be a valid string in Unicode"?

2013-01-08 Thread Whistler, Ken
> Sorry, but I have to disagree here. If a list of strings contains items > with lone surrogates (garbage), then sorting them doesn't make the > garbage go away, even if the items may be sorted in "correct" order > according to some criterion. Well, yeah, I wasn't claiming that the principled, "co

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-08 Thread Stephan Stiller
> Wouldn't the clean way be to ensure valid strings (only) when they're >> built >> > > Of course, the earlier erroneous data gets caught, the better. The problem > is that error checking is expensive, both in lines of code and in execution > time (I think there is data showing that in any real-lif

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-08 Thread Martin J. Dürst
On 2013/01/08 14:43, Stephan Stiller wrote: Wouldn't the clean way be to ensure valid strings (only) when they're built Of course, the earlier erroneous data gets caught, the better. The problem is that error checking is expensive, both in lines of code and in execution time (I think there i

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Mark Davis ☕
In practice and by design, treating isolated surrogates the same as reserved code points in processing, and then cleaning up on conversion to UTFs works just fine. It is a tradeoff that is up to the implementation. It has nothing to do with a "legacy of C pointer arithmetic". It does represent a p

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Stephan Stiller
Things like this are called "garbage in, garbage-out" (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Wouldn't the clean way be to ensure valid strings (only) when they're

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Whistler, Ken
> > http://www.unicode.org/reports/tr10/#Handline_Illformed Grrr. http://www.unicode.org/reports/tr10/#Handling_Illformed I seem unable to handle ill-formed spelling today. :( --Ken

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Whistler, Ken
Martin, The kind of situation Markus is talking about is illustrated particularly well in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to this issue,: http://www.unicode.org/reports/tr10/#Handline_Illformed When weighting Unicode 16-bit strings for collation, you can

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Mark Davis ☕
That's not the point (see successive messages). Mark * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 4:59 PM, "Martin J. Dürst" wrote: > On 2013/01/08 3:27, Markus Scherer wrote: > > Also, we commonly read code points from 16-

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Martin J. Dürst
On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things li

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Whistler, Ken
Philippe also said: > ... Reserving "UTF-16" for what the stadnard discusses as a > "16-bit string", except that it should still require UTF-16 > conformance (no unpaired surrogates and no non-characters) ... For those following along, conformance to UTF-16 does *NOT* require "no non-characters"

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Whistler, Ken
Philippe Verdy said: > Well then I don't know why you need a definition of an "Unicode 16-bit > string". For me it just means exactly the same as "16-bit string", and > the encoding in it is not relevant given you can put anything in it > without even needing to be conformant to Unicode. So a Java

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Mark Davis ☕
Because all well-formed sequences (and subsequences) are interpreted according to the corresponding UTF. That is quite different from random byte stream with no declared semantics, or a byte stream with a different declared semantic. Thus if you are given a Unicode 8-bit string <61, 62, 80, 63>, y

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Philippe Verdy
Well then I don't know why you need a definition of an "Unicode 16-bit string". For me it just means exactly the same as "16-bit string", and the encoding in it is not relevant given you can put anything in it without even needing to be conformant to Unicode. So a Java string is exactly the same, a

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Doug Ewell
You're right, and I stand corrected. I read Markus's post too quickly. Mark Davis ☕ wrote: >> But still non-conformant. > > That's incorrect. > > The point I was making above is that in order to say that something is > "non-conformant", you have to be very clear what it is "non-conformant" TO.

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Mark Davis ☕
> But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is "non-conformant", you have to be very clear what it is "non-conformant" *TO* . > Also, we commonly read code points from 16-bit Unicode strings, and > unpaired surrogates are retu

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Markus Scherer
On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell wrote: > Markus Scherer wrote: > > > Also, we commonly read code points from 16-bit Unicode strings, and > > unpaired surrogates are returned as themselves and treated as such > > (e.g., in collation). That would not be well-formed UTF-16, but it's > >

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Doug Ewell
Markus Scherer wrote: > Also, we commonly read code points from 16-bit Unicode strings, and > unpaired surrogates are returned as themselves and treated as such > (e.g., in collation). That would not be well-formed UTF-16, but it's > generally harmless in text processing. But still non-conforman

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Markus Scherer
Unicode libraries commonly provide functions that take a code point and return a value, for example a property value. Such a function normally accepts the whole range 0..10 (and may even return a default value for out-of-range inputs). Also, we commonly read code points from 16-bit Unicode str

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-06 Thread Stephan Stiller
On Sun, Jan 6, 2013 at 12:34 PM, Mark Davis ☕ wrote: > [...] > What you write and that the UTFs have historical artifact in their design makes sense to me. (There are many, many discussions of this in the Unicode email archives if > you have more questions.) > Okay. I am fine with ending this

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-06 Thread Mark Davis ☕
Some of this is simply historical: had Unicode been designed from the start with 8 and 16 bit forms in mind, some of this could be avoided. But that is water long under the bridge. Here is a simple example of why we have both UTFs and Unicode Strings. Java uses Unicode 16-bit Strings. The followin

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-05 Thread Stephan Stiller
> If for example I sit on a committee that devises a new encoding form, I > would need to be concerned with the question which *sequences of Unicode > code points* are sound. If this is the same as "sequences of Unicode > scalar values", I would need to exclude surrogates, if I read the standard >

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Stephan Stiller
If you are concerned with computer security If for example I sit on a committee that devises a new encoding form, I would need to be concerned with the question which /sequences of Unicode code points/ are sound. If this is the same as "sequences of Unicode scalar values", I would need to e

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Markus Scherer
On Fri, Jan 4, 2013 at 6:08 PM, Stephan Stiller wrote: > Is there a most general sense in which there are constraints beyond all > characters being from within the range U+ ... U+10? If one is > concerned with computer security, oddities that are absolute should raise a > flag; somebody co

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Stephan Stiller
Thanks for all the information. Is there a most general sense in which there are constraints beyond all characters being from within the range U+ ... U+10? If one is concerned with computer security, oddities that are absolute should raise a flag; somebody could be messing with my syst

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Mark Davis ☕
To assess whether a string is invalid, it all depends on what the string is supposed to be. 1. As Ken says, if a string is supposed to be in a given encoding form (UTF), but it consists of an ill-formed sequence of code units for that encoding form, it would be invalid. So an isolated surrogate (e

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Whistler, Ken
operation, say a display algorithm which detects that as an unacceptable edge condition and inserts a virtual base for the combining mark in order not to break the display. --Ken What does it mean to not be a valid string in Unicode? Is there a concise answer in one place? For example, if

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Stephan Stiller
A Unicode string in UTF-8 encoding form could be ill-formed if the bytes don't follow the specification for UTF-8, for example. Given that answer, add "in UTF-32" to my email just now, for simplicity's sake. Or let's simply assume we're dealing with some sort of sequence of abstract integers

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Stephan Stiller
What does it mean to not be a valid string in Unicode? Is there a concise answer in one place? For example, if one uses the noncharacters just mentioned by Ken Whistler ("intended for process-internal uses, but [...] not permitted for interchange"), what precisely does that mean

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Whistler, Ken
with a combining character, this new > string will not be a valid string in > Unicode. > > What does it mean to not be a valid string in Unicode? > > /Roger >

What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Costello, Roger L.
Hi Folks, In the book, Fonts & Encodings (p. 61, first paragraph) it says: ... we select a substring that begins with a combining character, this new string will not be a valid string in Unicode. What does it mean to not be a valid string in Unicode? /Roger