Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Philippe Verdy
2015-05-09 6:37 GMT+02:00 Markus Scherer :

> Where did you find that definition of "plain text"?
>

I have not said that Unicode defines what is plain-text. It is defined in
RFC describing the MIME type and giving the name "plain text".

> Unicode just defines "plain text" by contrast with "rich text" which is
> text with markup or other such structure. There is no limitation of code
> points associated with that term.
> http://unicode.org/glossary/#plain_text
>

This is not a definition, or just a mere definition of "Unicode plain text"
(i.e. more restrictive than "plain text"). Please don't add
restriction/qualifying words ("Unicode") that I did not use in my sentence
**on purpose**.

Plain text has been defined much longer before Unicode wrote its
informative glossary.


Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Philippe Verdy
Note: I used "16-bit string" in my sentence, NOT "Unicode 16-bit string"
which I used in the later part of my sentence (but also including 8-bit and
32-bit for the same restrictions in "Unicode strings")... So no
contradiction.


2015-05-09 7:55 GMT+02:00 Philippe Verdy :

>
>
> 2015-05-09 6:37 GMT+02:00 Markus Scherer :
>
>> On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy 
>> wrote:
>>
>>> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
>>> richard.wording...@ntlworld.com>:
>>>
 I can't think of a practical use for the specific concepts of Unicode
 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
 essentially the same as 16-bit strings, and Unicode 32-bit strings are
 UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
 pedantry; there are more useful categories of 8-bit strings that are
 not UTF-8 strings.

>>>
>>> And here you're wrong: a 16-bit string is just a sequence of arbitrary
>>> 16-bit code units, but an Unicode string (whatever the size of its code
>>> units) adds restrictions for validity (the only restriction being in fact
>>> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
>>> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
>>> forbidden.
>>>
>>
>> No, Richard had it right. See for example definition D82 "Unicode 16-bit
>> string" in the standard. (Section 3.9 Unicode Encoding Forms,
>> http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)
>>
>
> I was right, D82 refers to "UTF-16", which implies  the restriction of
> validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of
> non-characters).
>
> I was right, You and Richard were wrong.
>
>


Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Philippe Verdy
2015-05-09 6:37 GMT+02:00 Markus Scherer :

> On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy  wrote:
>
>> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
>> richard.wording...@ntlworld.com>:
>>
>>> I can't think of a practical use for the specific concepts of Unicode
>>> 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
>>> essentially the same as 16-bit strings, and Unicode 32-bit strings are
>>> UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
>>> pedantry; there are more useful categories of 8-bit strings that are
>>> not UTF-8 strings.
>>>
>>
>> And here you're wrong: a 16-bit string is just a sequence of arbitrary
>> 16-bit code units, but an Unicode string (whatever the size of its code
>> units) adds restrictions for validity (the only restriction being in fact
>> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
>> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
>> forbidden.
>>
>
> No, Richard had it right. See for example definition D82 "Unicode 16-bit
> string" in the standard. (Section 3.9 Unicode Encoding Forms,
> http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)
>

I was right, D82 refers to "UTF-16", which implies  the restriction of
validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of
non-characters).

I was right, You and Richard were wrong.


Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Markus Scherer
On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy  wrote:

> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
> richard.wording...@ntlworld.com>:
>
>> I can't think of a practical use for the specific concepts of Unicode
>> 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
>> essentially the same as 16-bit strings, and Unicode 32-bit strings are
>> UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
>> pedantry; there are more useful categories of 8-bit strings that are
>> not UTF-8 strings.
>>
>
> And here you're wrong: a 16-bit string is just a sequence of arbitrary
> 16-bit code units, but an Unicode string (whatever the size of its code
> units) adds restrictions for validity (the only restriction being in fact
> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
> forbidden.
>

No, Richard had it right. See for example definition D82 "Unicode 16-bit
string" in the standard. (Section 3.9 Unicode Encoding Forms,
http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)

I agree that the definitions for Unicode 8-bit and 32-bit strings are not
particularly useful.

For being "plain-text" there are additional restrictions: non-characters
> are also excluded, and only a small subset of controls (basically tabs and
> newlines) is allowed (the other controls, including U+ are restricted
> for private protocols and not designed for plain text... except
> specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO
> 2022 or Videotext which need these controls in fact to represent characters
> into sequences, possibly with contextual encoding).
>

Where did you find that definition of "plain text"?
Unicode just defines "plain text" by contrast with "rich text" which is
text with markup or other such structure. There is no limitation of code
points associated with that term.
http://unicode.org/glossary/#plain_text

markus


Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Philippe Verdy
2015-05-09 3:27 GMT+02:00 Daniel Bünzli :

> Le samedi, 9 mai 2015 à 02:33, Philippe Verdy a écrit :
> > 2015-05-08 14:32 GMT+02:00 Daniel Bünzli  (mailto:daniel.buen...@erratique.ch)>:
> > > Well did you test them all ? There's quite a big list here
> http://www.json.org. Taking a random one mentioned on that page leads me
> to http://golang.org/pkg/encoding/json/ in which they say that they
> replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very
> surprising since apparently go's strings as text are UTF-8 encoded so when
> you need to produce your results as UTF-8 then you don't have a lot of
> solutions... error and/or U+FFFD.
> >
> >
> > I've already saif that JSON is UTF-8 encoded by default, but this does
> not mean that JSON invalidates the escape sequence '\uD800' isolated in a
> string.
>
> You didn't get what I said. When a parser returns a JSON string it just
> parsed and that it wants to give it back to the programmer using the native
> string of the language and that these strings happen to be UTF-8 encoded in
> this language, then in presence of such lone surrogates you are stuck and
> need to do something as you cannot encode them in the UTF-8 string.
>

You are not stuck! You can still regenerate a valid JSON output encoded in
UTF-8: it will once again use escape sequences (which are also needed if
your text contains quotation marks used to delimit the JSON strings in its
syntax.

Unlike UTF-8, JSON has never been designed to restrict its strings to have
its represented values to be only plain-text, it is a only a serialization
of "strings" to valid plain-text using a custom syntax.

There's absolutely no need to restrict strings values to the same
validation rules and the same subset as the set of acceptable plain-text:
this is not the same layer: one is the string level (in fact not bound to
any character encoding and not restricted to text), another is the
plain-text, and JSON is the adapter/converter between these two
representations. Do not mix these two distinct layers.

(this is also the case when someone confuses an XML document with its DOM:
not the same layer)


Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Philippe Verdy
2015-05-09 5:13 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> I can't think of a practical use for the specific concepts of Unicode
> 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
> essentially the same as 16-bit strings, and Unicode 32-bit strings are
> UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
> pedantry; there are more useful categories of 8-bit strings that are
> not UTF-8 strings.
>

And here you're wrong: a 16-bit string is just a sequence of arbitrary
16-bit code units, but an Unicode string (whatever the size of its code
units) adds restrictions for validity (the only restriction being in fact
that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
forbidden.

So the concept of "Unicode string" is in fact the same as valid Unicode
text: it is a subset of possible strings, restricted by validation rules:
- for 8-bit strings (UTF-8) there are other constraints (not all bytes are
acceptable and some pairs of bytes are also restricted, and final bytes
cannot occur alone)
- for 16-bit strings (UTF-16), the only constraint is on isolated/unpaired
surrogates
- for 32-bit strings (UTF-32), the only constaint is on the two allowed
ranges of encoded code points (U+..U+D7FF and U+E000..U+10).

For being "plain-text" there are additional restrictions: non-characters
are also excluded, and only a small subset of controls (basically tabs and
newlines) is allowed (the other controls, including U+ are restricted
for private protocols and not designed for plain text... except
specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO
2022 or Videotext which need these controls in fact to represent characters
into sequences, possibly with contextual encoding).


Re: Surrogates and noncharacters

2015-05-08 Thread Richard Wordingham
On Sat, 9 May 2015 02:26:59 +0200
Daniel Bünzli  wrote:

> Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit :

> > This means noncharacters may appear in a well-formed UTF-8, -16, or
> > -32 string,

> It take "appear" to mean "be encoded". Yes, any Unicode encoding
> forms allows to interchange all scalar values by D79.

> (However noncharacters are not designed to be openly interchanged see
> "Restricted interchange" on p. 31. of 7.0.0)

That is irrelevant, for JSON is not restricted to open interchange.

Richard.



Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Richard Wordingham
On Sat, 9 May 2015 02:26:59 +0200
Daniel Bünzli  wrote:

> Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit :
> > Noncharacters are Unicode scalar values,

> (However noncharacters are not designed to be openly interchanged see
> "Restricted interchange" on p. 31. of 7.0.0)

That didn't stop their being openly interchanged.

> > They may both be part of a "Unicode string" which does not claim to
> > be in any given encoding form.

> Not sure what you mean by that. So I let someone else answer.  

There are a number of phrases whose declared meanings cannot be
deduced from the individual words.  A UTF-8, UTF-16 or UTF-32 string
defines a sequence of scalar values.  However, Unicode 8-bit, 16-bit
or 32-bit string is merely a sequence of 8-bit, 16-bit or 32-bit
values that may occur in a UTF-8, UTF-16 or UTF-32 string
respectively.  This definition has some odd consequences:

A Unicode 32-bit string is a UTF-32 string, for UTF-32 is not a
multi-word encoding.  An arbitrary string of unsigned 32-bit values is
not in general a Unicode 32-bit string.

All strings of unsigned 16-bit values are Unicode 16-bit strings.  Not
all (Unicode) 16-bit strings are UTF-16 strings.

Not all strings of unsigned 8-bit values are Unicode 8-bit strings, and
not all Unicode 8-bit strings are UTF-8 strings.

I can't think of a practical use for the specific concepts of Unicode
8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
essentially the same as 16-bit strings, and Unicode 32-bit strings are
UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
pedantry; there are more useful categories of 8-bit strings that are
not UTF-8 strings.

Richard.



Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Daniel Bünzli
Le samedi, 9 mai 2015 à 02:33, Philippe Verdy a écrit :
> 2015-05-08 14:32 GMT+02:00 Daniel Bünzli  (mailto:daniel.buen...@erratique.ch)>:
> > Well did you test them all ? There's quite a big list here 
> > http://www.json.org. Taking a random one mentioned on that page leads me to 
> > http://golang.org/pkg/encoding/json/ in which they say that they replace 
> > invalid UTF-16 surrogate pairs by U+FFFD. This is really not very 
> > surprising since apparently go's strings as text are UTF-8 encoded so when 
> > you need to produce your results as UTF-8 then you don't have a lot of 
> > solutions... error and/or U+FFFD.
>  
>  
> I've already saif that JSON is UTF-8 encoded by default, but this does not 
> mean that JSON invalidates the escape sequence '\uD800' isolated in a string.

You didn't get what I said. When a parser returns a JSON string it just parsed 
and that it wants to give it back to the programmer using the native string of 
the language and that these strings happen to be UTF-8 encoded in this 
language, then in presence of such lone surrogates you are stuck and need to do 
something as you cannot encode them in the UTF-8 string.  

(I understand that in *your* interpretation this should not happen since I 
should define a special data type to represent these JSON strings so that they 
behave like JavaScript strings; that would be indeed very practical, none of my 
language native string tools can be used on that…)
  
Anyways, we are largely OT at this point.  

Best,

Daniel





Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Philippe Verdy
2015-05-08 14:32 GMT+02:00 Daniel Bünzli :

> Le vendredi, 8 mai 2015 à 13:48, Philippe Verdy a écrit :
> > JSON came initially from Javascript, and it is used extensively with
> Javascript.
>
> But not *only* for a long time now.
>
> > The RFC is deviating from the currently running implementations.
>
> Well did you test them all ? There's quite a big list here
> http://www.json.org. Taking a random one mentioned on that page leads me
> to http://golang.org/pkg/encoding/json/ in which they say that they
> replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very
> surprising since apparently go's strings as text are UTF-8 encoded so when
> you need to produce your results as UTF-8 then you don't have a lot of
> solutions... error and/or U+FFFD.
>

I've already saif that JSON is UTF-8 encoded by default, but this does not
mean that JSON invalidates the escape sequence '\uD800' isolated in a
string.
For this reason JSON strings are not restricted by the textual encoding of
its syntaxic representation.

So no error returned, no replacement by U+FFFD and even unpaired surrogates
are possible, provided that they are escaped.
Basically JSON strings remain equivalent to Javascript strings where
'\uD800' is also a perfectly valid "string".

I make the difference between a "string" and plain-text.

And if the RFC had not been so confusive by mixing terms (notably the term
"code point", it would have may be become a standard. For now it is just a
tentative attempt to standardize it, but it does not work with existing
implementation which have started since the begining as a data
serialization format based on Javascript syntax (with only the removal of
items that are not pure data, such as functions/methods, and more complex
objects like Javascript regexp literals (functionaly equivalent to an
object constructor), object references... keeping only strings, numbers,
and only two structures: ordered arrays and unordered associative arrays
(also called dictionaries and that are also including ordered arrays
considered as associative using number keys, thus reducing it to only one
effetctive structure even if ordered arrays have also a simpler syntaxic
sugar to represent them in a more compact way).

If you mean that JSON string "\uD800" is invalid, it is not longer a data
serialization for Javascript, or other languages also using JSON as a
possible syntax for serializing data into plain-text. JSON was created
because XML (the alternative) was too verbose and had restrictions in its
"text" elements. It seems that the RFC just wants to apply to JSON the same
restrictions as found in XML, but it deviates JSON from its objective, and
I'm convinced that such restrictions are not enforced at all in many JSON
implementations that do not attempt to validate if the value of the
represented string a valid plain-text. JSON is only transforming strings
into valid plain-text representation using an encoding syntax using
separators and escape sequences, nothing else.

If the RFC wants to add such restrictions, it is mixing two layers: the
syntaxic (plain text) layer and the lower layer for the internally
represented values which are just a stream of code units.

And the only difference in that case is the behavior for isolated/unpaired
surrogates (not restricted in Javascript or many languages defining
"strings", but restricted in plain-text, but JSON is there to offer the
serializatrion scheme allowing strings to be safely converted to plain-text)


Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Daniel Bünzli
Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit :
> Noncharacters are Unicode scalar values,

Non characters are Unicode scalar values by definitions D14 and D76.
  
> while unpaired surrogates are not.

All surrogates code points are not Unicode scalar values by D71, D73 and D76.
  
> This means noncharacters may appear in a well-formed UTF-8, -16, or
> -32 string,

It take "appear" to mean "be encoded". Yes, any Unicode encoding forms allows 
to interchange all scalar values by D79.

(However noncharacters are not designed to be openly interchanged see 
"Restricted interchange" on p. 31. of 7.0.0)

> while unpaired surrogates may not.
All surrogate code points *paired or not* cannot be encoded in UTF-{8,16,32} by 
D92, D91, D90. All these encoding forms, by definition, assign only Unicode 
scalar values to code units sequences (see also the already mentioned p. 31. 
which clarifies this).

However in UTF-16 code unit sequences may contain surrogate pairs (that taken 
together represent a Unicode scalar value).

> They may both be part of a "Unicode string" which does not claim to be in any 
> given encoding
> form.

Not sure what you mean by that. So I let someone else answer.  

Best,

Daniel  





Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Doug Ewell
Richard Wordingham  wrote:

>> Try by yourself, you can perfectly send JSON text containing '\u'
>> (non-character) or '\uF800' (unpaired surrogate) and I've not seen
>> any JSON implementation complaining about one or the other, when
>> receiving the JSON stream and using it in Javascript, you'll see no
>> missing code unit or replaced code units and no exception as well.
>
> Unicode Consortium standards and recommendations allow non-characters
> to be sent; as far as I can make out, they are just not to be thought
> of as unstandardised graphic characters.

As I understand it, from a purely Unicode standpoint, there are
differences here between noncharacters and unpaired surrogates.

Noncharacters are Unicode scalar values, while unpaired surrogates are
not. This means noncharacters may appear in a well-formed UTF-8, -16, or
-32 string, while unpaired surrogates may not. They may both be part of
a "Unicode string" which does not claim to be in any given encoding
form.

Authoritative corrections are welcome to help solidify my understanding.

I don't wish to get involved in debates over JSON. I've read RFC 7159
and I know what it says.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Richard Wordingham
On Fri, 8 May 2015 05:08:21 +0200
Philippe Verdy  wrote:

> Try by yourself, you can perfectly send JSON text containing '\u'
> (non-character) or '\uF800' (unpaired surrogate) and I've not seen
> any JSON implementation complaining about one or the other, when
> receiving the JSON stream and using it in Javascript, you'll see no
> missing code unit or replaced code units and no exception as well.

Unicode Consortium standards and recommendations allow non-characters
to be sent; as far as I can make out, they are just not to be thought of
as unstandardised graphic characters.

Richard.


Re: Script / font support in Windows 10

2015-05-08 Thread Richard Wordingham
On Fri, 8 May 2015 17:16:01 +
"Andrew Glass (WINDOWS)"  wrote:

> I agree that there is some work to be done to ensure correct display
> of Tai Tham. That work may involve changes to USE in a future update.

That's as I understood it, which I is why I was surprised by the
degree of commitment in the overview.  I did wonder if the overview had
been written long ago, so its author was unaware of there being issues
with USE and Tai Tham.

For example, I got the impression that you had contemplated cloning USE
and modifying that clone for Tai Tham, so as to keep the USE simpler.
(In the meantime, it may make sense to use the USE for Tai Tham, and let
the font clean up the inappropriate dotted circles.  I currently do
that for applications that use old versions of HarfBuzz.) Also, I
hadn't expected you to commit to a timetable.

Richard.


RE: Script / font support in Windows 10

2015-05-08 Thread Andrew Glass (WINDOWS)
Hi Richard,

I agree that there is some work to be done to ensure correct display of Tai 
Tham. That work may involve changes to USE in a future update. We will have a 
panel on Universal Shaping at the upcoming IUC conference. That will be a good 
opportunity for a discussion between implementers and font developers. If you 
are able to attend that would be great. If not, we can certainly go through the 
proposed changes you have sent.

Cheers,

Andrew


-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham
Sent: Friday, May 8, 2015 9:50 AM
To: unicode@unicode.org
Subject: Re: Script / font support in Windows 10

On Fri, 8 May 2015 14:15:55 +
Peter Constable  wrote:

> I think this is the right public link:
> 
> https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx

Does this confirm the intention of Microsoft that at some stage the Universal 
Shaping Engine (USE) in Windows 10 will support the Tai Tham script?  In 
February we discovered that the USE didn't support syllable-final 
SAKOT+consonant - the commonest and eponymous use of
U+1A60 TAI THAM SIGN SAKOT, which may well be the commonest character
in the Tai Tham script.  For example, we can't write the name of the city of 
'Chiang Rai' in the Tai Tham script using the USE.

Richard.



Re: Script / font support in Windows 10

2015-05-08 Thread Richard Wordingham
On Fri, 8 May 2015 14:15:55 +
Peter Constable  wrote:

> I think this is the right public link:
> 
> https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx

Does this confirm the intention of Microsoft that at some stage the
Universal Shaping Engine (USE) in Windows 10 will support the Tai Tham
script?  In February we discovered that the USE didn't support
syllable-final SAKOT+consonant - the commonest and eponymous use of
U+1A60 TAI THAM SIGN SAKOT, which may well be the commonest character
in the Tai Tham script.  For example, we can't write the name of the
city of 'Chiang Rai' in the Tai Tham script using the USE.

Richard.


Re: Script / font support in Windows 10

2015-05-08 Thread Mark Davis ☕️
Thanks!


Mark 

*— Il meglio è l’inimico del bene —*

On Fri, May 8, 2015 at 7:15 AM, Peter Constable 
wrote:

>  I think this is the right public link:
>
>
>
> https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx
>
>
>
>
>
> *From:* Peter Constable
> *Sent:* Thursday, May 7, 2015 10:29 PM
> *To:* Peter Constable; unicode@unicode.org
> *Subject:* RE: Script / font support in Windows 10
>
>
>
> Oops… my bad: maybe it isn’t on live servers yet. It will be soon. I’ll
> update with the public link when it is.
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org
> ] *On Behalf Of *Peter Constable
> *Sent:* Thursday, May 7, 2015 10:15 PM
> *To:* unicode@unicode.org
> *Subject:* Script / font support in Windows 10
>
>
>
> This page on MSDN that provides an overview of Windows support for
> different scripts has now been updated for Windows 10:
>
>
>
> https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099
>
>
>
>
>
>
>
> Peter
>


Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Doug Ewell
I interpreted Roger Costello's original question literally, that he
wanted to find instances of '\u' that do not represent an ASSIGNED
Unicode character. Apologies if this discussion is really about
something else.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸



RE: Script / font support in Windows 10

2015-05-08 Thread Peter Constable
I think this is the right public link:

https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx


From: Peter Constable
Sent: Thursday, May 7, 2015 10:29 PM
To: Peter Constable; unicode@unicode.org
Subject: RE: Script / font support in Windows 10

Oops... my bad: maybe it isn't on live servers yet. It will be soon. I'll 
update with the public link when it is.

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Peter Constable
Sent: Thursday, May 7, 2015 10:15 PM
To: unicode@unicode.org
Subject: Script / font support in Windows 10

This page on MSDN that provides an overview of Windows support for different 
scripts has now been updated for Windows 10:

https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099



Peter


Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Daniel Bünzli
Le vendredi, 8 mai 2015 à 13:48, Philippe Verdy a écrit :
> JSON came initially from Javascript, and it is used extensively with 
> Javascript.  

But not *only* for a long time now.
  
> The RFC is deviating from the currently running implementations.

Well did you test them all ? There's quite a big list here http://www.json.org. 
Taking a random one mentioned on that page leads me to 
http://golang.org/pkg/encoding/json/ in which they say that they replace 
invalid UTF-16 surrogate pairs by U+FFFD. This is really not very surprising 
since apparently go's strings as text are UTF-8 encoded so when you need to 
produce your results as UTF-8 then you don't have a lot of solutions... error 
and/or U+FFFD.   

In any case deviating or not, that's for good since it would be insane to 
impose JavaScript's string as a data structure for an interchange format that 
intents to be universal and *textual*.
  
Best,

Daniel



Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Philippe Verdy
2015-05-08 11:27 GMT+02:00 Costello, Roger L. :

>  Okay, I gave it a try. I created this string which contains binary data
> (sequence of arbitrary unsigned integers):
>
>
>
> "
> --
> æä}gõ› "
>
>
> I did not say that these data had not to be properly escaped. With
escaping (\u) it works with arbitrary sequences of 16-bit code-units.


Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Philippe Verdy
JSON came initially from Javascript, and it is used extensively with
Javascript. My tests with their JSON parser is that any string that is
valdi for Javascript is also valid in JSON (no exception raised, no
replaced characters, no deleted characters even if there are unpaired
surrogates or non-characters like '\u').
The RFC is deviating from the currently running implementations.


2015-05-08 13:04 GMT+02:00 Daniel Bünzli :

> Le vendredi, 8 mai 2015 à 05:08, Philippe Verdy a écrit :
> > The RFC is jsut informative not normative,
>
> RFC 7159 is not informational, it is a proposed standard.
>
> > Try by yourself, you can perfectly send JSON text containing '\u'
> (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON
> implementation complaining about one or the other,
> Well now you have (mine). The RFC is very clear that we are dealing with
> *text-based* data not *binary* data. Maybe programming languages that
> represent their Unicode strings as possibly invalid UTF-16 sequences will
> happily input this but as section 8.2 mentions that may not be the case
> everywhere, software receiving these values  "might return different values
> for the length of a string value or even suffer fatal runtime exceptions".
>
> Best,
>
> Daniel
>
>
>


Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Daniel Bünzli
Le vendredi, 8 mai 2015 à 05:08, Philippe Verdy a écrit :
> The RFC is jsut informative not normative,  

RFC 7159 is not informational, it is a proposed standard.  

> Try by yourself, you can perfectly send JSON text containing '\u' 
> (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON 
> implementation complaining about one or the other,  
Well now you have (mine). The RFC is very clear that we are dealing with 
*text-based* data not *binary* data. Maybe programming languages that represent 
their Unicode strings as possibly invalid UTF-16 sequences will happily input 
this but as section 8.2 mentions that may not be the case everywhere, software 
receiving these values  "might return different values for the length of a 
string value or even suffer fatal runtime exceptions".  

Best,

Daniel





RE: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Costello, Roger L.
Philippe Verdy wrote:


Ø  implementations just support JSON as plain 16-bit streams

Ø  Try by yourself, you can perfectly send JSON text containing

Ø   '\u' (non-character) or '\uD800' (unpaired surrogate) and

Ø  I've not seen any JSON implementation complaining about one

Ø  or the other

Okay, I gave it a try. I created this string which contains binary data 
(sequence of arbitrary unsigned integers):

"

æä}gõ› "

When I validated that string against this JSON Schema:

{
   "type" : "string"
}

using this online validator: https://json-schema-validator.herokuapp.com/

I got an error: Invalid JSON: parse error, line 1

I am pretty sure that Daniel is correct, JSON cannot contain arbitrary bit 
streams.


Ø  The RFC is just informative not normative

Interesting! What does that mean? JSON vendors are free to ignore the JSON RFC 
and do as they see fit?

/Roger

From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy
Sent: Thursday, May 07, 2015 11:08 PM
To: Daniel Bünzli
Cc: Unicode@unicode.org; Costello, Roger L.; Markus Scherer
Subject: Re: Ways to detect that  in JSON \u does not correspond to a 
Unicode character?

The RFC is jsut informative not normative, and thez effective usage and 
implementations just support JSON as plain 16-bit streams, even if the 
transport syntax requires encoding it in plain-text (using some UTF, not 
necessarily UTF-8 even if this is the default).
Try by yourself, you can perfectly send JSON text containing '\u' 
(non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON 
implementation complaining about one or the other, when receiving the JSON 
stream and using it in Javascript, you'll see no missing code unit or replaced 
code units and no exception as well.

2015-05-08 3:22 GMT+02:00 Daniel Bünzli 
mailto:daniel.buen...@erratique.ch>>:
Le vendredi, 8 mai 2015 à 02:16, Philippe Verdy a écrit :
> It would be more exact to say that JSON strings, just like strings in 
> Javascript and Java or many programming languages are just binary streams of 
> 16-bit code units.

I suggest you have a careful read at RFC 7159 as it specifically implies that 
this is not the model it supports (albeit using broken or let's say 
ambiguous/imprecise Unicode terminology).

> Then the JSON processor will decode this text and will remap it to an 
> internal UTF-16 encoding (for characters that are not escaped) and the 
> "\u" will be decoded as plain 16-bit code units. The result will be a 
> stream of 16-bit code units, which can then externally be outpout and encoded 
> or stored in any convenient encoding that preserves this stream, EVEN if this 
> is not valid UTF-16.

I don't know where you get this from but you won't find any mention of this in 
the standard. We are dealing with text, Unicode scalar values, not encodings. 
At the risk of repeating myself, read section 8.2 of RFC 7159.

Best,

Daniel