Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.

The proposal is to make ICU's spec violation conforming. I think there
is both a technical and a political reason why the proposal is a bad
idea.

First, the technical reason:

ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake (since Unicode grew past 16 bits
anyway making UTF-16 both variable-width *and*
ASCII-incompatible--i.e. widening the the code units to be
ASCII-incompatible didn't buy a constant-width encoding after all) and
that when the legacy constraints of Win32, Java, C#, JavaScript, ICU,
etc. don't force UTF-16 as the internal Unicode representation, using
UTF-8 as the internal Unicode representation is the technically
superior design: Using UTF-8 as the internal Unicode representation is
memory-efficient and cache-efficient when dealing with data formats
whose syntax is mostly ASCII (e.g. HTML), forces developers to handle
variable-width issues right away, makes input decode a matter of mere
validation without copy when the input is conforming and makes output
encode infinitely fast (no encode step needed).

Therefore, despite UTF-16 being widely used as an in-memory
representation of Unicode and in no way going away, I think the
Unicode Consortium should be *very* sympathetic to technical
considerations for implementations that use UTF-8 as the in-memory
representation of Unicode.

When looking this issue from the ICU perspective of using UTF-16 as
the in-memory representation of Unicode, it's easy to consider the
proposed change as the easier thing for implementation (after all, no
change for the ICU implementation is involved!). However, when UTF-8
is the in-memory representation of Unicode and "decoding" UTF-8 input
is a matter of *validating* UTF-8, a state machine that rejects a
sequence as soon as it's impossible for the sequence to be valid UTF-8
(under the definition that excludes surrogate code points and code
points beyond U+10) makes a whole lot of sense. If the proposed
change was adopted, while Draconian decoders (that fail upon first
error) could retain their current state machine, implementations that
emit U+FFFD for errors and continue would have to add more state
machine states (i.e. more complexity) to consolidate more input bytes
into a single U+FFFD even after a valid sequence is obviously
impossible.

When the decision can easily go either way for implementations that
use UTF-16 internally but the options are not equal when using UTF-8
internally, the "UTF-8 internally" case should be decisive.
(Especially when spec-wise that decision involves no change. I further
note the proposal PDF argues on the level of "feels right" without
even discussing the impact on implementations that use UTF-8
internally.)

As a matter of implementation experience, the implementation I've
written (https://github.com/hsivonen/encoding_rs) supports both the
UTF-16 as the in-memory Unicode representation and the UTF-8 as the
in-memory Unicode representation scenarios, and the fail-fast
requirement wasn't onerous in the UTF-16 as the in-memory
representation scenario.

Second, the political reason:

Now that ICU is a Unicode Consortium project, I think the Unicode
Consortium should be particular sensitive to biases arising from being
both the source of the spec and the source of a popular
implementation. It looks *really bad* both in terms of equal footing
of ICU vs. other implementations for the purpose of how the standard
is developed as well as the reliability of the standard text vs. ICU
source code as the source of truth that other implementors need to pay
attention to if the way the Unicode Consortium resolves a discrepancy
between ICU behavior and a well-known spec provision (this isn't some
ill-known corner case, after all) is by changing the spec instead of
changing ICU *especially* when the change is not neutral for
implementations that have made different but completely valid per
then-existing spec and, in the absence of legacy constraints, superior
architectural choices compared to ICU (i.e. UTF-8 internally instead
of UTF-16 internally).

I can see the irony of this viewpoint coming from a WHATWG-aligned
browser developer, but I note that even browsers that use ICU for
legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior
isn't, in fact, the dominant browser UTF-8 behavior. That is, even
Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the
environment that's the most sensitive to how issues like 

Are Emoji ZWJ sequences characters?

2017-05-15 Thread William_J_G Overington via Unicode
I am concerned about emoji ZWJ sequences being encoded without going through 
the ISO process and whether Unicode will therefore lose synchronization with 
ISO/IEC 10646.

I have raised this by email and a very helpful person has advised me that 
encoding emoji sequences does not mean that Unicode and ISO/IEC 10646 go out of 
being synchronized because ZWJ sequences are not *characters*, and they have no 
implications for ISO/IEC 10646, noting that ISO/IEC 10646 does not define ZWJ 
sequences. 

Now I have great respect for the person who advised me. However I am a 
researcher and I opine that I need evidence.

Thus I am writing to the mailing list in the hope that there will be a 
discussion please.

http://www.unicode.org/reports/tr51/tr51-11.html (A proposed update document)

http://www.unicode.org/Public/emoji/5.0/emoji-zwj-sequences.txt

http://www.unicode.org/charts/PDF/U1F300.pdf

http://www.unicode.org/charts/PDF/U1F680.pdf

In tr51-11.html at 2.3 Emoji ZWJ Sequences

quote

To the user of such a system, these behave like single emoji characters, even 
though internally they are sequences.

end quote

In emoji-zwj-sequences.txt there is the following line.

1F468 200D 1F680; Emoji_ZWJ_Sequence  ; man 
astronaut 

>From U1F300.pdf, 1F468 is MAN

200D is ZWJ

>From U1F680.pdf 1F680 is ROCKET

The reasoning upon which I base my concern is as follows.

0063 is c

0070 is p

0074 is t

If 0063 200D 0074 is used to specifically request a ct ligature in a display of 
some text, then the meaning of 0063 200D 0074 is the same as the meaning of 
0063 0074 and indeed a font with an OpenType table could cause a ct ligature to 
be displayed even if the sequence is 0063 0074 rather than the sequence 0063 
200D 0074 that is used where the ligature glyph is specifically requested. Thus 
the meaning of ct is not changed by using the ZWJ character.

Now the use of the ct ligature is well-known and frequent.

Suppose now that a fontmaker is making a font of his or her own and decides to 
include a glyph for a pp ligature, with a swash flourish joining and going 
beyond the lower ends of the descenders both to the left and to the right.

The fontmaker could note that the ligature might be good in a word like copper 
but might look wrong in a word like happy due to the tail on the letter y 
clashing with the rightward side of the swash flourish. So the fontmaker 
encodes 0070 200D 0070 as a pp ligature but does not encode 0070 0070 as a pp 
ligature, so that the ligature glyph is only used when specifically requested 
using a ZWJ character.

However, when the ZWJ character is used, the meaning of the pp sequence is not 
changed from the meaning when the pp sequence is not used.

Yet when 1F468 200D 1F680 is used, the meaning of the sequence is different 
from the meaning of the sequence 1F468 1F680 such that the meaning of 1F468 
200D 1F680 is listed in a file available from the Unicode website.

>From where does the astronaut's spacesuit and helmet come?

I am reminded that in chemistry if one mixes two chemicals, sometimes one just 
gets a mixture of two chemicals and sometimes one gets a chemical reaction such 
that another chemical is produced.

Repeating the quote from earlier in this post.

In tr51-11.html at 2.3 Emoji ZWJ Sequences

quote

To the user of such a system, these behave like single emoji characters, even 
though internally they are sequences.

end quote

I am concerned that in the future a user of ISO/IEC 10646 will not be able to 
find from ISO/IEC 10646 the meaning of an emoji that he or she observes being 
displayed, even if he or she is able to discover what is the sequence of 
characters being used.

So I ask that this matter be discussed please.

William Overington

Monday 15 May 2017



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Alastair Houghton via Unicode
On 15 May 2017, at 11:21, Henri Sivonen via Unicode  wrote:
> 
> In reference to:
> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
> 
> I think Unicode should not adopt the proposed change.

Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
multiple errors there makes no sense.

> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
> representative of implementation concerns of implementations that use
> UTF-8 as their in-memory Unicode representation.
> 
> Even though there are notable systems (Win32, Java, C#, JavaScript,
> ICU, etc.) that are stuck with UTF-16 as their in-memory
> representation, which makes concerns of such implementation very
> relevant, I think the Unicode Consortium should acknowledge that
> UTF-16 was, in retrospect, a mistake

You may think that.  There are those of us who do not.  The fact is that UTF-16 
makes sense as a default encoding in many cases.  Yes, UTF-8 is more efficient 
for primarily ASCII text, but that is not the case for other situations and the 
fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 
usually focus on) is no more complicated than handling combining characters, 
which you have to do anyway.

> Therefore, despite UTF-16 being widely used as an in-memory
> representation of Unicode and in no way going away, I think the
> Unicode Consortium should be *very* sympathetic to technical
> considerations for implementations that use UTF-8 as the in-memory
> representation of Unicode.

I don’t think the Unicode Consortium should be unsympathetic to people who use 
UTF-8 internally, for sure, but I don’t see what that has to do with either the 
original proposal or with your criticism of UTF-16.

[snip]

> If the proposed
> change was adopted, while Draconian decoders (that fail upon first
> error) could retain their current state machine, implementations that
> emit U+FFFD for errors and continue would have to add more state
> machine states (i.e. more complexity) to consolidate more input bytes
> into a single U+FFFD even after a valid sequence is obviously
> impossible.

“Impossible”?  Why?  You just need to add some error states (or *an* error 
state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the 
only library that already did just that *because it’s clearly the right thing 
to do*.

Kind regards,

Alastair.

--
http://alastairs-place.net




RE: Are Emoji ZWJ sequences characters?

2017-05-15 Thread Peter Constable via Unicode
Emoji sequences are not _encoded_, per se, in either Unicode or ISO/IEC 10646. 
The act of "encoding" in either of these coding standards is to assign an 
encoded representation in the encoding method of the standards for a given 
entity. In this case, that means to assign a code point. 

Specifying ZWJ sequences for representation of text elements is not encoding in 
the standard; it is simply defining an encoded representation for those text 
elements. Unicode gives some attention to this kind of thing, but ISO/IEC 
10646, not so much. For instance, you won't find anything in ISO/IEC 10646 
specifying that the encoded representation for a rakaar is < VIRAMA, RA >.

So, your helpful person was, indeed, helpful, giving you correct information: 
ZWJ sequences are not _characters_ and have no implications for ISO/IEC 10646.


Peter

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of William_J_G 
Overington via Unicode
Sent: Monday, May 15, 2017 7:57 AM
To: unicode@unicode.org
Subject: Are Emoji ZWJ sequences characters?

I am concerned about emoji ZWJ sequences being encoded without going through 
the ISO process and whether Unicode will therefore lose synchronization with 
ISO/IEC 10646.

I have raised this by email and a very helpful person has advised me that 
encoding emoji sequences does not mean that Unicode and ISO/IEC 10646 go out of 
being synchronized because ZWJ sequences are not *characters*, and they have no 
implications for ISO/IEC 10646, noting that ISO/IEC 10646 does not define ZWJ 
sequences. 

Now I have great respect for the person who advised me. However I am a 
researcher and I opine that I need evidence.

Thus I am writing to the mailing list in the hope that there will be a 
discussion please.

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Freports%2Ftr51%2Ftr51-11.html&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636304584722863385&sdata=IWXir%2BfVIg2NW5Q95ClTs5Powet54k5VFEyJaEL7KYE%3D&reserved=0
 (A proposed update document)

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2FPublic%2Femoji%2F5.0%2Femoji-zwj-sequences.txt&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636304584722863385&sdata=2TzPVAvyTRaLqFBx8gKG%2BvwK86DTzcZgnQpPYuaQto8%3D&reserved=0

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Fcharts%2FPDF%2FU1F300.pdf&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636304584722863385&sdata=aG3AQEN8iwsyJtcLZFdKYBsM682sGCuBDUTyf8lyhy4%3D&reserved=0

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Fcharts%2FPDF%2FU1F680.pdf&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636304584722863385&sdata=xC2tM5TFs9XLDbbYqfTaeVULxe8ciShAlgbWGQfknPg%3D&reserved=0

In tr51-11.html at 2.3 Emoji ZWJ Sequences

quote

To the user of such a system, these behave like single emoji characters, even 
though internally they are sequences.

end quote

In emoji-zwj-sequences.txt there is the following line.

1F468 200D 1F680; Emoji_ZWJ_Sequence  ; man 
astronaut 

>From U1F300.pdf, 1F468 is MAN

200D is ZWJ

>From U1F680.pdf 1F680 is ROCKET

The reasoning upon which I base my concern is as follows.

0063 is c

0070 is p

0074 is t

If 0063 200D 0074 is used to specifically request a ct ligature in a display of 
some text, then the meaning of 0063 200D 0074 is the same as the meaning of 
0063 0074 and indeed a font with an OpenType table could cause a ct ligature to 
be displayed even if the sequence is 0063 0074 rather than the sequence 0063 
200D 0074 that is used where the ligature glyph is specifically requested. Thus 
the meaning of ct is not changed by using the ZWJ character.

Now the use of the ct ligature is well-known and frequent.

Suppose now that a fontmaker is making a font of his or her own and decides to 
include a glyph for a pp ligature, with a swash flourish joining and going 
beyond the lower ends of the descenders both to the left and to the right.

The fontmaker could note that the ligature might be good in a word like copper 
but might look wrong in a word like happy due to the tail on the letter y 
clashing with the rightward side of the swash flourish. So the fontmaker 
encodes 0070 200D 0070 as a pp ligature but does not encode 0070 0070 as a pp 
ligature, so that the ligature glyph is only used when specifically requested 
using a ZWJ character.

However, when the ZWJ character is used, the meaning of the pp sequence is not 
changed from the meaning when the pp sequence is not used.

Yet when 1F468 200D 1F680 is used, the meaning of the sequence is different 
from the meaning of the sequence 

Re: Are Emoji ZWJ sequences characters?

2017-05-15 Thread Richard Wordingham via Unicode
On Mon, 15 May 2017 16:14:23 +
Peter Constable via Unicode  wrote:

> So, your helpful person was, indeed, helpful, giving you correct
> information: ZWJ sequences are not _characters_ and have no
> implications for ISO/IEC 10646.

Except in so far as the claimed ligature changes the meaning of the
ligated elements.  For example, using <'a', ZWJ, 'e'> for an a-umlaut
that was clearly not a-diaeresis would probably be on the edge of what
is permissible.  Returning to the example, shouldn't 1F468 200D 1F680
mean 'male rocket maker'?

Richard.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Asmus Freytag via Unicode

On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:

On 15 May 2017, at 11:21, Henri Sivonen via Unicode  wrote:

In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.

Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
multiple errors there makes no sense.


Changing a specification as fundamental as this is something that should 
not be undertaken lightly.


Apparently we have a situation where implementations disagree, and have 
done so for a while. This normally means not only that the 
implementations differ, but that data exists in both formats.


Even if it were true that all data is only stored in UTF-8, any data 
converted from UFT-8 back to UTF-8 going through an interim stage that 
requires UTF-8 conversion would then be different based on which 
converter is used.


Implementations working in UTF-8 natively would potentially see three 
formats:

1) the original ill-formed data
2) data converted with single FFFD
3) data converted with multiple FFFD

These forms cannot be compared for equality by binary matching.

The best that can be done is to convert (1) into one of the other forms 
and then compare treating any run of FFFD code points as equal to any 
other run, irrespective of length.
(For security-critical applications, the presence of any FFFD should 
render the data invalid, so the comparisons we'd be talking about here 
would be for general purpose, like search).


Because we've had years of multiple implementations, it would be 
expected that copious data exists in all three formats, and that data 
will not go away. Changing the specification to pick one of these 
formats as solely conformant is IMHO too late.


A./





ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake

You may think that.  There are those of us who do not.  The fact is that UTF-16 
makes sense as a default encoding in many cases.  Yes, UTF-8 is more efficient 
for primarily ASCII text, but that is not the case for other situations and the 
fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 
usually focus on) is no more complicated than handling combining characters, 
which you have to do anyway.


Therefore, despite UTF-16 being widely used as an in-memory
representation of Unicode and in no way going away, I think the
Unicode Consortium should be *very* sympathetic to technical
considerations for implementations that use UTF-8 as the in-memory
representation of Unicode.

I don’t think the Unicode Consortium should be unsympathetic to people who use 
UTF-8 internally, for sure, but I don’t see what that has to do with either the 
original proposal or with your criticism of UTF-16.

[snip]


If the proposed
change was adopted, while Draconian decoders (that fail upon first
error) could retain their current state machine, implementations that
emit U+FFFD for errors and continue would have to add more state
machine states (i.e. more complexity) to consolidate more input bytes
into a single U+FFFD even after a valid sequence is obviously
impossible.

“Impossible”?  Why?  You just need to add some error states (or *an* error 
state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the 
only library that already did just that *because it’s clearly the right thing 
to do*.

Kind regards,

Alastair.

--
http://alastairs-place.net







Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Asmus Freytag via Unicode

On 5/15/2017 3:21 AM, Henri Sivonen via Unicode wrote:

Second, the political reason:

Now that ICU is a Unicode Consortium project, I think the Unicode
Consortium should be particular sensitive to biases arising from being
both the source of the spec and the source of a popular
implementation. It looks*really bad*  both in terms of equal footing
of ICU vs. other implementations for the purpose of how the standard
is developed as well as the reliability of the standard text vs. ICU
source code as the source of truth that other implementors need to pay
attention to if the way the Unicode Consortium resolves a discrepancy
between ICU behavior and a well-known spec provision (this isn't some
ill-known corner case, after all) is by changing the spec instead of
changing ICU*especially*  when the change is not neutral for
implementations that have made different but completely valid per
then-existing spec and, in the absence of legacy constraints, superior
architectural choices compared to ICU (i.e. UTF-8 internally instead
of UTF-16 internally).

I can see the irony of this viewpoint coming from a WHATWG-aligned
browser developer, but I note that even browsers that use ICU for
legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior
isn't, in fact, the dominant browser UTF-8 behavior. That is, even
Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the
environment that's the most sensitive to how issues like this are
handled, so it would be appropriate for the proposal to survey current
browser behavior instead of just saying that ICU "feels right" or is
"natural".


I think this political reason should be taken very seriously. There are 
already too many instances where ICU can be seen "driving" the 
development of property and algorithms.


Those involved in the ICU project may not see the problem, but I agree 
with Henri that it requires a bit more sensitivity from the UTC.


A./



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Alastair Houghton via Unicode
On 15 May 2017, at 18:52, Asmus Freytag  wrote:
> 
> On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:
>> On 15 May 2017, at 11:21, Henri Sivonen via Unicode  
>> wrote:
>>> In reference to:
>>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>> 
>>> I think Unicode should not adopt the proposed change.
>> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
>> multiple errors there makes no sense.
> 
> Changing a specification as fundamental as this is something that should not 
> be undertaken lightly.

Agreed.

> Apparently we have a situation where implementations disagree, and have done 
> so for a while. This normally means not only that the implementations differ, 
> but that data exists in both formats.
> 
> Even if it were true that all data is only stored in UTF-8, any data 
> converted from UFT-8 back to UTF-8 going through an interim stage that 
> requires UTF-8 conversion would then be different based on which converter is 
> used.
> 
> Implementations working in UTF-8 natively would potentially see three formats:
> 1) the original ill-formed data
> 2) data converted with single FFFD
> 3) data converted with multiple FFFD
> 
> These forms cannot be compared for equality by binary matching.

But that was always true, if you were under the impression that only one of (2) 
and (3) existed, and indeed claiming equality between two instances of U+FFFD 
might be problematic itself in some circumstances (you don’t know why the 
U+FFFDs were inserted - they may not replace the same original data).

> The best that can be done is to convert (1) into one of the other forms and 
> then compare treating any run of FFFD code points as equal to any other run, 
> irrespective of length.

It’s probably safer, actually, to refuse to compare U+FFFD as equal to anything 
(even itself) unless a special flag is passed.  For “general purpose” 
applications, you could set that flag and then a single U+FFFD would compare 
equal to another single U+FFFD; no need for the complicated “any string of 
U+FFFD” logic (which in any case makes little sense - it could just as easily 
generate erroneous comparisons as fix the case we’re worrying about here).

> Because we've had years of multiple implementations, it would be expected 
> that copious data exists in all three formats, and that data will not go 
> away. Changing the specification to pick one of these formats as solely 
> conformant is IMHO too late.

I don’t think so.  Even if we acknowledge the possibility of data in the other 
form, I think it’s useful guidance to implementers, both now and in the future. 
 One might even imagine that the other, non-favoured form, would eventually 
fall out of use.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton
 wrote:
> On 15 May 2017, at 11:21, Henri Sivonen via Unicode  
> wrote:
>>
>> In reference to:
>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>
>> I think Unicode should not adopt the proposed change.
>
> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
> multiple errors there makes no sense.

The currently-specced behavior makes perfect sense when you add error
emission on top of a fail-fast UTF-8 validation state machine.

>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
>> representative of implementation concerns of implementations that use
>> UTF-8 as their in-memory Unicode representation.
>>
>> Even though there are notable systems (Win32, Java, C#, JavaScript,
>> ICU, etc.) that are stuck with UTF-16 as their in-memory
>> representation, which makes concerns of such implementation very
>> relevant, I think the Unicode Consortium should acknowledge that
>> UTF-16 was, in retrospect, a mistake
>
> You may think that.  There are those of us who do not.

My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Shawn Steele via Unicode
>> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
>> multiple errors there makes no sense.
> 
> Changing a specification as fundamental as this is something that should not 
> be undertaken lightly.

IMO, the only think that can be agreed upon is that "something's bad with this 
UTF-8 data".  I think that whether it's treated as a single group of corrupt 
bytes or each individual byte is considered a problem should be up to the 
implementation.

#1 - This data should "never happen".  In a system behaving normally, this 
condition should never be encountered.  
  * At this point the data is "bad" and all bets are off.
  * Some applications may have a clue how the bad data could have happened and 
want to do something in particular.
  * It seems odd to me to spend much effort standardizing a scenario that 
should be impossible.
#2 - Depending on implementation, either behavior, or some combination, may be 
more efficient.  I'd rather allow apps to optimize for the common case, not the 
case-that-shouldn't-ever-happen
#3 - We have no clue if this "maximal" sequence was a single error, 2 errors, 
or even more.  The lead byte says how many trail bytes should follow, and those 
should be in a certain range.  Values outside of those conditions are illegal, 
so we shouldn't ever encounter them.  So if we did, then something really weird 
happened.  
  * Did a single character get misencoded?
  * Was an illegal sequence illegally encoded?
  * Perhaps a byte got corrupted in transmission?
  * Maybe we dropped a packet/block, so this is really the beginning of a valid 
sequence and the tail of another completely valid sequence?

In practice, all that most apps would be able to do would be to say "You have 
bad data, how bad I have no clue, but it's not right".  A single bit could've 
flipped, or you could have only 3 pages of a 4000 page document.  No clue at 
all.  At that point it doesn't really matter how many FFFD's the error(s) are 
replaced with, and no assumptions should be made about the severity of the 
error.

-Shawn



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Asmus Freytag via Unicode

On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote:



ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake

You may think that.  There are those of us who do not.

My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.

UTF-16 has some nice properties and there's not need to brand it a 
"mistake". UTF-8 has different nice properties, but there's equally not 
reason to treat it as more special than UTF-16.


The UTC should adopt a position of perfect neutrality when it comes to 
assuming in-memory representation, in other words, not make assumptions 
that optimizing for any encoding form will benefit implementers.


UTC, where ICU is strongly represented, needs to guard against basing 
encoding/properties/algorithm decisions (edge cases mostly), solely or 
primarily on the needs of a particular implementation that happens to be 
chosen by the ICU project.


A./



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread David Starner via Unicode
On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode <
unicode@unicode.org> wrote:

> Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the
> case for other situations


UTF-8 is clearly more efficient space-wise that includes more ASCII
characters than characters between U+0800 and U+. Given the prevalence
of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic
will pretty much always be smaller in UTF-8.

Even for scripts that go from 2 bytes to 3, webpages can get much smaller
in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a
factor of 1.8). The max change in reverse is 1.5, as two bytes goes to
three.


> and the fact is that handling surrogates (which is what proponents of
> UTF-8 or UCS-4 usually focus on) is no more complicated than handling
> combining characters, which you have to do anyway.
>

Not necessarily; you can legally process Unicode text without worrying
about combining characters, whereas you cannot process UTF-16 without
handling surrogates.


RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Shawn Steele via Unicode
I’m not sure how the discussion of “which is better” relates to the discussion 
of ill-formed UTF-8 at all.

And to the last, saying “you cannot process UTF-16 without handling surrogates” 
seems to me to be the equivalent of saying “you cannot process UTF-8 without 
handling lead & trail bytes”.  That’s how the respective encodings work.

One could look at it and think “there are 128 unicode characters that have the 
same value in UTF-8 as UTF-32,” and “there are xx thousand unicode characters 
that have the same value in UTF-16 and UTF-32.”

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner 
via Unicode
Sent: Monday, May 15, 2017 2:38 PM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode 
mailto:unicode@unicode.org>> wrote:
Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case 
for other situations

UTF-8 is clearly more efficient space-wise that includes more ASCII characters 
than characters between U+0800 and U+. Given the prevalence of spaces and 
ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic will pretty much 
always be smaller in UTF-8.
Even for scripts that go from 2 bytes to 3, webpages can get much smaller in 
UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a factor of 
1.8). The max change in reverse is 1.5, as two bytes goes to three.

and the fact is that handling surrogates (which is what proponents of UTF-8 or 
UCS-4 usually focus on) is no more complicated than handling combining 
characters, which you have to do anyway.

Not necessarily; you can legally process Unicode text without worrying about 
combining characters, whereas you cannot process UTF-16 without handling 
surrogates.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Richard Wordingham via Unicode
On Mon, 15 May 2017 21:38:26 +
David Starner via Unicode  wrote:

> > and the fact is that handling surrogates (which is what proponents
> > of UTF-8 or UCS-4 usually focus on) is no more complicated than
> > handling combining characters, which you have to do anyway.

> Not necessarily; you can legally process Unicode text without worrying
> about combining characters, whereas you cannot process UTF-16 without
> handling surrogates.

The problem with surrogates is inadequate testing.  They're sufficiently
rare for many users that it may be a long time before an error is
discovered.  It's not always obvious that code is designed for UCS-2
rather than UTF-16.

Richard.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Philippe Verdy via Unicode
2017-05-15 19:54 GMT+02:00 Asmus Freytag via Unicode :

> I think this political reason should be taken very seriously. There are
> already too many instances where ICU can be seen "driving" the development
> of property and algorithms.
>
> Those involved in the ICU project may not see the problem, but I agree
> with Henri that it requires a bit more sensitivity from the UTC.
>
I don't think that the fact that ICU was originately using UTF-16
internally has ANY effect on the decision to represent ill-formed sequences
as single or multiple U+FFFD.
The internal encoding has nothing in common with the external encoding used
when processing input data (which may be UTf-8, UTF-16, UTF-32, and could
in all case present ill-formed sequences). That internal encoding here will
paly no role in how to convert the ill-formed input, or if it will be
converted.
So yes, independantly of the internal encoding, we'll still ahve to choose
between:
- not converting the input and return an error or throw an exception
- converting the input using a single U+FFFD (in its internal
representation, this does not matter) to replace the complete sequence of
ill-formed code units in the input data, and preferably return an error
status
- converting the input using as many U+FFFD (in its internal
representation, this does not matter)  to replace every ocurence of
ill-formed code units in the input data, and preferably return an error
status.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Philippe Verdy via Unicode
Softwares designed with only UCS-2 and not real UTF-16 support are still
used today

For example MySQL with its broken "UTF-8" encoding which in fact encodes
supplementary characters as two separate 16-bit code-units for surrogates,
each one blindly encoded as 3-byte sequences which would be ill-formed in
standard UTF-8, buit that also does not differentiate invalid pairs of
surrogates, and offers no collation support for supplementary characters.

In this case some other softwares will break silently on these sequences
(for example Mediawiki when installed with a MySQL backend server whose
datastore was created with its broken "UTF-8", will silently discard any
text starting at the first supplementary character found in the wikitext.
This is not a problem of Mediawiki but the fact the MediaWiki does NOT
support such MySQL server isntalled with its "UTF-8" datastore, but only
supports MySQL if the storage encoding declared for the database was
"binary" (but in that case there's no support of collation in MySQL, texts
are just containing any random sequences of bytes and internationalization
is then made in the client software, here Mediawiki and its PHP, ICU, or
Lua libraries, and other tools written in Perl and other languages)

Note that this does not affect Wikimedia in its wikis because they were
initially installed corectly with the binary encoding in MySQL, but now
Wikimedia wikis use another database engine with native UTF-8 support and
full coverage of the UCS. Other wikis using Mediawiki will need to upgrade
their MySQL version if they want to keep it for adminsitrative reasons (and
not convert again their datastore to the binary encoding).

Softwares running with only UCS-2 are exposed to such risks similar to the
one seen in MediaWiki on incorrect MySQL installations, where any user may
edit a page to insert any supplementary character (supplementary sinograms,
emojis, Gothic letters, supplementary symbols...) which will look correct
when previewing, and correct when it is parsed, accepted silently by MySQL,
but then silently truncated because of the encoding error: when reloading
the data from MySQL, there will effectively be unexpectedly discarded data.

How to react to the risks of data losses or truncation ? Throwing an
exception or just returning an error is in fact more dangerous than just
replacing the ill-formed sequences by one or more U+FFFD: we preserve as
much as possible, but anyway softwares should be able to perform some tests
in their datastore to see if they correctly handle the encoding: this could
be done when starting the sofware and emitting log messages when the
backend do not support the encoding: all that is needed is to send a single
supplementary character to the remote datastore in a junk table or field
and then retrieve it immediately in another transaction to make sure it is
preserved. Similar tests can be done to see if the remote datastore also
preserves the encoding form or "normalizes it, or alters it (this
alteration could happen with a leading BOM and some other silent
alterations could be made on NULL and trailing spaces if the datastore does
not use text fields with varying length but fixed length instead). Similar
tests could be done to check the maximum length accepted (a VARCHAR(256) on
a binary-encoded database will not always store 256 Unciode characters, but
in a database encoded with non borken UTF-8, it should store 256 codepoints
independantly of theior values, even if their UTF-8 encoding would be up to
1024 bytes.


2017-05-16 0:43 GMT+02:00 Richard Wordingham via Unicode <
unicode@unicode.org>:

> On Mon, 15 May 2017 21:38:26 +
> David Starner via Unicode  wrote:
>
> > > and the fact is that handling surrogates (which is what proponents
> > > of UTF-8 or UCS-4 usually focus on) is no more complicated than
> > > handling combining characters, which you have to do anyway.
>
> > Not necessarily; you can legally process Unicode text without worrying
> > about combining characters, whereas you cannot process UTF-16 without
> > handling surrogates.
>
> The problem with surrogates is inadequate testing.  They're sufficiently
> rare for many users that it may be a long time before an error is
> discovered.  It's not always obvious that code is designed for UCS-2
> rather than UTF-16.
>
> Richard.
>


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Karl Williamson via Unicode

On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote:

In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.

The proposal is to make ICU's spec violation conforming. I think there
is both a technical and a political reason why the proposal is a bad
idea.



Henri's claim that "The proposal is to make ICU's spec violation 
conforming" is a false statement, and hence all further commentary based 
on this false premise is irrelevant.


I believe that ICU is actually currently conforming to TUS.

The proposal reads:

"For UTF-8, recommend evaluating maximal subsequences based on the 
original structural definition of UTF-8..."


There is nothing in here that is requiring any implementation to be 
changed.  The word "recommend" does not mean the same as "require". 
Have you guys been so caught up in the current international political 
situation that you have lost the ability to read straight?


TUS has certain requirements for UTF-8 handling, and it has certain 
other "Best Practices" as detailed in 3.9.  The proposal involves 
changing those recommendations.  It does not involve changing any 
requirements.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
 wrote:
> I’m not sure how the discussion of “which is better” relates to the
> discussion of ill-formed UTF-8 at all.

Clearly, the "which is better" issue is distracting from the
underlying issue. I'll clarify what I meant on that point and then
move on:

I acknowledge that UTF-16 as the internal memory representation is the
dominant design. However, because UTF-8 as the internal memory
representation is *such a good design* (when legacy constraits permit)
that *despite it not being the current dominant design*, I think the
Unicode Consortium should be fully supportive of UTF-8 as the internal
memory representation and not treat UTF-16 as the internal
representation as the one true way of doing things that gets
considered when speccing stuff.

I.e. I wasn't arguing against UTF-16 as the internal memory
representation (for the purposes of this thread) but trying to
motivate why the Consortium should consider "UTF-8 internally" equally
despite it not being the dominant design.

So: When a decision could go either way from the "UTF-16 internally"
perspective, but one way clearly makes more sense from the "UTF-8
internally" perspective, the "UTF-8 internally" perspective should be
decisive in *such a case*. (I think the matter at hand is such a
case.)

At the very least a proposal should discuss the impact on the "UTF-8
internally" case, which the proposal at hand doesn't do.

(Moving on to a different point.)

The matter at hand isn't, however, a new green-field (in terms of
implementations) issue to be decided but a proposed change to a
standard that has many widely-deployed implementations. Even when
observing only "UTF-16 internally" implementations, I think it would
be appropriate for the proposal to include a review of what existing
implementations, beyond ICU, do.

Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
test with three major browsers that use UTF-16 internally and have
independent (of each other) implementations of UTF-8 decoding
(Firefox, Edge and Chrome) shows agreement on the current spec: there
is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
6 on the second, 4 on the third and 6 on the last line). Changing the
Unicode standard away from that kind of interop needs *way* better
rationale than "feels right".

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/