On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode <unicode@unicode.org> wrote: > On Wed, 31 May 2017 15:12:12 +0300 > Henri Sivonen via Unicode <unicode@unicode.org> wrote: >> I am not claiming it's too difficult to implement. I think it >> inappropriate to ask implementations, even from-scratch ones, to take >> on added complexity in error handling on mere aesthetic grounds. Also, >> I think it's inappropriate to induce implementations already written >> according to the previous guidance to change (and risk bugs) or to >> make the developers who followed the previous guidance with precision >> be the ones who need to explain why they aren't following the new >> guidance. > > How straightforward is the FSM for back-stepping?
This seems beside the point, since the new guidance wasn't advertised as improving backward stepping compared to the old guidance. (On the first look, I don't see the new guidance improving back stepping. In fact, if the UTC meant to adopt ICU's behavior for obsolete five and six-byte bit patterns, AFAICT, backstepping with the ICU behavior requires examining more bytes backward than the old guidance required.) >> On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode >> <unicode@unicode.org> wrote: >> > The UTF-8 conversion code that I wrote for ICU, and apparently the >> > code that various other people have written, collects sequences >> > starting from lead bytes, according to the original spec, and at >> > the end looks at whether the assembled code point is too low for >> > the lead byte, or is a surrogate, or is above 10FFFF. Stopping at a >> > non-trail byte is quite natural, and reading the PRI text >> > accordingly is quite natural too. >> >> I don't doubt that other people have written code with the same >> concept as ICU, but as far as non-shortest form handling goes in the >> implementations I tested (see URL at the start of this email) ICU is >> the lone outlier. > > You should have researched implementations as they were in 2007. I don't see how the state of things in 2007 is relevant to a decision taken in 2017. It's relevant that by 2017, prominent implementations had adopted the old Unicode guidance, and, that being the case, it's inappropriate to change the guidance for aesthetic reasons or to favor the Unicode Consortium-hosted implementation. On Wed, May 31, 2017 at 8:43 PM, Shawn Steele via Unicode <unicode@unicode.org> wrote: > I do not understand the energy being invested in a case that shouldn't > happen, especially in a case that is a subset of all the other bad cases that > could happen. I'm a browser developer. I've explained previously on this list and in my blog post why the browser developer / Web standard culture favors well-defined behavior in error cases these days. On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode <unicode@unicode.org> wrote: > Henri Sivonen wrote: > >> If anything, I hope this thread results in the establishment of a >> requirement for proposals to come with proper research about what >> multiple prominent implementations to about the subject matter of a >> proposal concerning changes to text about implementation behavior. > > Considering that several folks have objected that the U+FFFD > recommendation is perceived as having the weight of a requirement, I > think adding Henri's good advice above as a "requirement" seems > heavy-handed. Who will judge how much research qualifies as "proper"? In the Unicode scope, it's indeed harder to draw clear line to decide what the prominent implementations are than in the WHATWG scope. The point is that just checking ICU is not good enough. Someone making a proposal should check the four major browser engines and a bunch of system frameworks and standard libraries for well-known programming languages. Which frameworks and standard libraries and how many is not precisely definable objectively and depends on the subject matter (there are many UTF-8 decoders but e.g. fewer text shaping engines). There will be diminishing returns to checking them. Chances are that it's not necessary to check too many for a pattern to emerge to judge whether the existing spec language is being implemented (don't change it) or being ignored (probably should be changed then). In any case, "we can't check everything or choose fairly what exactly to check" shouldn't be a reason for it to be OK to just check ICU or to make abstract arguments without checking any implementations at all. Checking multiple popular implementations is homework better done than just checking ICU even if it's up to the person making the proposal to choose which implementations to check exactly. The committee should be able to recognize if the list of implementations tested looks like a list of broadly-deployed implementations. On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode <unicode@unicode.org> wrote: > * As far as I can tell, there are two (maybe three) sane approaches to this > problem: > * Either a "maximal" emission of one U+FFFD for every byte that > exists outside of a good sequence > * Or a "minimal" version that presumes the lead byte was counting > trail bytes correctly even if the resulting sequence was invalid. In that > case just use one U+FFFD. > * And (maybe, I haven't heard folks arguing for this one) emit one > U+FFFD at the first garbage byte and then ignore the input until valid data > starts showing up again. (So you could have 1 U+FFFD for a string of a > hundred garbage bytes as long as there weren't any valid sequences within > that group). I think it's not useful to come up with new rules in the abstract. I'd like to focus on the fact that the Standard expressed a preference and the preference got implemented (in broadly-deployed well-known software). That being the case, it's not OK to change the preference expressed in the standard as a matter of what "feels right" or "sane" subsequently when there wasn't a super-serious problem with the previously-expressed preference that already got implemented in multiple pieces of broadly-deployed software whose developers took the Standard's expression of preference seriously. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/