Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Martin J. Dürst via Unicode Tue, 30 May 2017 04:32:05 -0700

Hello Karl, others,

On 2017/05/27 06:15, Karl Williamson via Unicode wrote:

On 05/26/2017 12:22 PM, Ken Whistler wrote:
On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
The link provided about the PRI doesn't lead to the comments.
PRI #121 (August, 2008) pre-dated the practice of keeping all thefeedback comments together with the PRI itself in a numbered directorywith the name "feedback.html". But the comments were collectedtogether at the time and are accessible here:
http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121

Also there was a separately submitted comment document:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

And the minutes of the pertinent UTC meeting (UTC #116):

http://www.unicode.org/L2/L2008/08253.htm
The minutes simply capture the consensus to adopt Option #2 from PRI#121, and the relevant action items.
I now return the floor to the distinguished disputants to continuelitigating history. ;-)
--Ken
The reason this discussion got started was that in December, someonecame to me and said the code I support does not follow Unicode bestpractices, and suggested I need to change, though no ticket (yet) hasbeen filed. I was surprised, and posted a query to this list about whatthe advantages of the new approach are.

Can you provide a reference to that discussion? I might have missed itin December.

There were a number of replies,but I did not see anything that seemed definitive. After a month, Icreated a ticket in Unicode and Markus was assigned to research it, andcame up with the proposal currently being debated.

Which is to completely reverse the current recommendation in Unicode9.0. While I agree that this might help you fending off a bug report, itwould create chances for bug reports for Ruby, Python3, many if not allWeb browsers,...

Looking at the PRI, it seems to me that treating an overlong as a singlemaximal unit is in the spirit of the wording, if not the fine print.


In standards, the "fine print" matters.

That seems to be borne out by Markus, even with his stake in ICU,supporting option #2.

Well, at http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121, Ialso supported option 2, with code behind it.

Looking at the comments, I don't see any discussion of the effect ofthis on overlong treatments. My guess is that the effect change wasunintentional.

I agree that it was probably not considered explicitly. But overlongswere disallowed for security reasons, and once the definition of UTF-8was tightened, "overlongs" essentially did not exist anymore.Essentially, "overlong" is a word like "dragon" or "ghost": Everybodyknows what it means, but everybody knows they don't exist.


[Just to be sure, by the above, I don't mean that a sequence such as

C0 B0 cannot appear somewhere in some input. But C0 is not UTF-8 all byitself, and there is no need to see C0 B0 as a (ghost) sequence.]

So I have code that handled overlongs in the only correct way possiblewhen they were acceptable,

No. As long as they were acceptable, they wouldn't have been replaced byan FFFD.

and in the obvious way after they became illegal,

Why? A change was necessary from producing an actual character toproducing some number of FFFDs. It may have been easier to produce justa single FFFD, but that depends on how the code was organized.

and now without apparent discussion (which is very much akin to"flimsy reasons"), it suddenly was no longer "best practice".

Not 'now', but almost 9 years ago. And not "without apparentdiscussion", but with an explicit PRI.

And thatchange came "rather late in the game". That this escaped notice foryears indicates that the specifics of REPLACEMENT CHAR handling don'tmatter all that much.


I agree. You haven't even yet received a ticket yet.

To cut to the chase, I think Unicode should issue a Corrigendum to theeffect that it was never the intent of this change to say that treatingoverlongs as a single unit isn't best practice. I'm not sure thiswarrants a full-fledge Corrigendum, though. But I believe the text ofthe best practices should indicate that treating overlongs as a singleunit is just as acceptable as Martin's interpretation.

I'd essentially be fine with that, under the condition that the currentrecommendation is maintained as a clearly identified recommendation, sothat Python3, Ruby, Web standards and browsers, and so on can easilyrefer to it.


Regards,   Martin.

I believe this is pretty much in line with Shawn's position. Certainly,a discussion of the reasons one might choose one interpretation overanother should be included in TUS. That would likely have satisfied myoriginal query, which hence would never have been posted.
.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to