On Tue, May 16, 2017 at 9:36 PM, Markus Scherer wrote:
> Let me try to address some of the issues raised here.
Thank you.
> The proposal changes a recommendation, not a requirement.
This is a very bad reason in favor of the change. If anything, this
should be a reason why there is no need to ch
Another alternative for you API is to not return simple integer values, but
return (read-only) instances of a Char32 class whose "scalar" property
would normally be a valid codepoint with scalar value, or whose "string"
property will be the actual character; but with another static property
"isVali
> Faster ok, privided this does not break other uses, notably for random
> access within strings…
Either way, this is a “recommendation”. I don’t see how that can provide for
not-“breaking other uses.” If it’s internal, you can do what you will, so if
you need the 1:1 seeming parity, then yo
2017-05-16 20:50 GMT+02:00 Shawn Steele :
> But why change a recommendation just because it “feels like”. As you
> said, it’s just a recommendation, so if that really annoyed someone, they
> could do something else (eg: they could use a single FFFD).
>
>
>
> If the recommendation is truly that me
On Tue, 16 May 2017 11:36:39 -0700
Markus Scherer via Unicode wrote:
> Why do we care how we carve up an illegal sequence into subsequences?
> Only for debugging and visual inspection. Maybe some process is using
> illegal, overlong sequences to encode something special (à la Java
> string serial
But why change a recommendation just because it “feels like”. As you said,
it’s just a recommendation, so if that really annoyed someone, they could do
something else (eg: they could use a single FFFD).
If the recommendation is truly that meaningless or arbitrary, then we just get
into silly d
On 16 May 2017, at 19:36, Markus Scherer wrote:
>
> Let me try to address some of the issues raised here.
Thanks for jumping in.
The one thing I wanted to ask about was the “without ever restricting trail
bytes to less than 80..BF”. I think that could be misinterpreted; having
thought about
Let me try to address some of the issues raised here.
The proposal changes a recommendation, not a requirement. Conformance
applies to finding and interpreting valid sequences properly. This includes
not consuming parts of valid sequences when dealing with illegal ones, as
explained in the section
> On 16 May 2017, at 20:01, Philippe Verdy wrote:
>
> On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random
> sequences of 16-bit code units are not permitted. There's visibly a
> validation step that returns an error if you attempt to create files with
> invalid sequences (
2017-05-16 19:30 GMT+02:00 Shawn Steele via Unicode :
> C) The data was corrupted by some other means. Perhaps bad
> concatenations, lost blocks during read/transmission, etc. If we lost 2
> 512 byte blocks, then maybe we should have a thousand FFFDs (but how would
> we known?)
>
Thousands of U
On 5/16/2017 10:30 AM, Shawn Steele via
Unicode wrote:
Would you advocate replacing
e0 80 80
with
U+FFFD U+FFFD U+FFFD (1)
ra
Regardless, it's not legal and hasn't been legal for quite some time.
Replacing a hacked embedded "null" with FFFD is going to be pretty breaking to
anything depending on that fake-null, so one or three isn't really going to
matter.
-Original Message-
From: Unicode [mailto:unicode-boun
On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random
sequences of 16-bit code units are not permitted. There's visibly a
validation step that returns an error if you attempt to create files with
invalid sequences (including other restrictions such as forbidding U+
and some ot
On Tue, 16 May 2017 17:30:01 +
Shawn Steele via Unicode wrote:
> > Would you advocate replacing
>
> > e0 80 80
>
> > with
>
> > U+FFFD U+FFFD U+FFFD (1)
>
> > rather than
>
> > U+FFFD (2)
>
> > It’s pretty clear what the intent of the encoder was
> Would you advocate replacing
> e0 80 80
> with
> U+FFFD U+FFFD U+FFFD (1)
> rather than
> U+FFFD (2)
> It’s pretty clear what the intent of the encoder was there, I’d say, and
> while we certainly don’t
> want to decode it as a NUL (that was the source of previ
> On 16 May 2017, at 18:38, Alastair Houghton
> wrote:
>
> On 16 May 2017, at 17:23, Hans Åberg wrote:
>>
>> HFS implements case insensitivity in a layer above the filesystem raw
>> functions. So it is perfectly possible to have files that differ by case
>> only in the same directory by usi
On 16 May 2017, at 17:23, Hans Åberg wrote:
>
> HFS implements case insensitivity in a layer above the filesystem raw
> functions. So it is perfectly possible to have files that differ by case only
> in the same directory by using low level function calls. The Tenon MachTen
> did that on Mac O
> On 16 May 2017, at 18:13, Alastair Houghton
> wrote:
>
> On 16 May 2017, at 17:07, Hans Åberg wrote:
>>
> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on
> UCS-2/UTF-16. ...
The filesystem directory is using octet sequences and does not bother
On 16 May 2017, at 17:07, Hans Åberg wrote:
>
HFS(+), NTFS and VFAT long filenames are all encoded in some variation on
UCS-2/UTF-16. ...
>>>
>>> The filesystem directory is using octet sequences and does not bother
>>> passing over an encoding, I am told. Someone could remember one
> On 16 May 2017, at 17:52, Alastair Houghton
> wrote:
>
> On 16 May 2017, at 16:44, Hans Åberg wrote:
>>
>> On 16 May 2017, at 17:30, Alastair Houghton via Unicode
>> wrote:
>>>
>>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on
>>> UCS-2/UTF-16. ...
>>
>> The
On 16 May 2017, at 16:44, Hans Åberg wrote:
>
> On 16 May 2017, at 17:30, Alastair Houghton via Unicode
> wrote:
>>
>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on
>> UCS-2/UTF-16. ...
>
> The filesystem directory is using octet sequences and does not bother pass
> On 16 May 2017, at 17:30, Alastair Houghton via Unicode
> wrote:
>
> On 16 May 2017, at 14:23, Hans Åberg via Unicode wrote:
>>
>> You don't. You have a filename, which is a octet sequence of unknown
>> encoding, and want to deal with it. Therefore, valid Unicode transformations
>> of the
On 16 May 2017, at 14:23, Hans Åberg via Unicode wrote:
>
> You don't. You have a filename, which is a octet sequence of unknown
> encoding, and want to deal with it. Therefore, valid Unicode transformations
> of the filename may result in that is is not being reachable.
>
> It only matters th
2017-05-16 15:23 GMT+02:00 Hans Åberg :
> All current filsystems, as far as experts could recall, use octet
> sequences at the lowest level; whatever encoding is used is built in a
> layer above
>
Not NTFS (on Windows) which uses sequences of 16bit units. Same about
FAT32/exFAT within "Long File
> On 16 May 2017, at 15:00, Philippe Verdy wrote:
>
> 2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode :
>
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode
> > wrote:
> ...
> > I think Unicode should not adopt the proposed change.
>
> It would be useful, for use with filesystems, to
On Tue, 16 May 2017 14:44:44 +0200
Hans Åberg via Unicode wrote:
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode
> > wrote:
> ...
> > I think Unicode should not adopt the proposed change.
>
> It would be useful, for use with filesystems, to have Unicode
> codepoint markers that indi
On Tue, 16 May 2017 20:08:52 +0900
"Martin J. Dürst via Unicode" wrote:
> I agree with others that ICU should not be considered to have a
> special status, it should be just one implementation among others.
> [The next point is a side issue, please don't spend too much time on
> it.] I find it
2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode :
>
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode
> wrote:
> ...
> > I think Unicode should not adopt the proposed change.
>
> It would be useful, for use with filesystems, to have Unicode codepoint
> markers that indicate how UTF-8, inc
> On 15 May 2017, at 12:21, Henri Sivonen via Unicode
> wrote:
...
> I think Unicode should not adopt the proposed change.
It would be useful, for use with filesystems, to have Unicode codepoint markers
that indicate how UTF-8, including non-valid sequences, is translated into
UTF-32 in a way
2017-05-16 12:40 GMT+02:00 Henri Sivonen via Unicode :
> > One additional note: the standard codifies this behaviour as a
> *recommendation*, not a requirement.
>
> This is an odd argument in favor of changing it. If the argument is
> that it's just a recommendation that you don't need to adhere t
Hello everybody,
[using this mail to in effect reply to different mails in the thread]
On 2017/05/16 17:31, Henri Sivonen via Unicode wrote:
On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote:
Under what circumstance would it matter how many U+FFFDs you see?
Maybe it doesn't, but I don
>
> The proposal actually does cover things that aren’t structurally valid,
> like your e0 e0 e0 example, which it suggests should be a single U+FFFD
> because the initial e0 denotes a three byte sequence, and your 80 80 80
> example, which it proposes should constitute three illegal subsequences
>
On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton
wrote:
> On 16 May 2017, at 09:31, Henri Sivonen via Unicode
> wrote:
>>
>> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
>> wrote:
>>> That would be true if the in-memory representation had any effect on what
>>> we’re talking about, bu
On 16 May 2017, at 09:31, Henri Sivonen via Unicode wrote:
>
> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
> wrote:
>> That would be true if the in-memory representation had any effect on what
>> we’re talking about, but it really doesn’t.
>
> If the internal representation is UTF-16 (
> On 16 May 2017, at 10:29, David Starner wrote:
>
> On Tue, May 16, 2017 at 1:45 AM Alastair Houghton
> wrote:
> That’s true anyway; imagine the database holds raw bytes, that just happen to
> decode to U+FFFD. There might seem to be *two* names that both contain
> U+FFFD in the same place
On Tue, May 16, 2017 at 1:45 AM Alastair Houghton <
alast...@alastairs-place.net> wrote:
> That’s true anyway; imagine the database holds raw bytes, that just happen
> to decode to U+FFFD. There might seem to be *two* names that both contain
> U+FFFD in the same place. How do you distinguish bet
> On 16 May 2017, at 09:18, David Starner wrote:
>
> On Tue, May 16, 2017 at 12:42 AM Alastair Houghton
> wrote:
>> If you’re about to mutter something about security, consider this: security
>> code *should* refuse to compare strings that contain U+FFFD (or at least
>> should never treat th
On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote:
> but I think the way he raises this point is needlessly antagonistic.
I apologize. My level of dismay at the proposal's ICU-centricity overcame me.
On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
wrote:
> That would be true if the in-m
On Tue, May 16, 2017 at 12:42 AM Alastair Houghton <
alast...@alastairs-place.net> wrote:
> If you’re about to mutter something about security, consider this:
> security code *should* refuse to compare strings that contain U+FFFD (or at
> least should never treat them as equal, even to themselves)
On Tue, 16 May 2017 10:01:03 +0300
Henri Sivonen via Unicode wrote:
> Even so, I think even changing a recommendation of "best practice"
> needs way better rationale than "feels right" or "ICU already does it"
> when a) major browsers (which operate in the most prominent
> environment of broken a
On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode <
unicode@unicode.org> wrote:
> On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
> wrote:
> > I’m not sure how the discussion of “which is better” relates to the
> > discussion of ill-formed UTF-8 at all.
>
> Clearly, the "which
On 16 May 2017, at 08:22, Asmus Freytag via Unicode wrote:
> I therefore think that Henri has a point when he's concerned about tacit
> assumptions favoring one memory representation over another, but I think the
> way he raises this point is needlessly antagonistic.
That would be true if the
On 15 May 2017, at 23:43, Richard Wordingham via Unicode
wrote:
>
> The problem with surrogates is inadequate testing. They're sufficiently
> rare for many users that it may be a long time before an error is
> discovered. It's not always obvious that code is designed for UCS-2
> rather than UT
On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen wrote:
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome) shows agreeme
On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote:
On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
wrote:
I’m not sure how the discussion of “which is better” relates to the
discussion of ill-formed UTF-8 at all.
Clearly, the "which is better" issue is distracting from the
under
On 15 May 2017, at 23:16, Shawn Steele via Unicode wrote:
>
> I’m not sure how the discussion of “which is better” relates to the
> discussion of ill-formed UTF-8 at all.
It doesn’t, which is a point I made in my original reply to Henry. The only
reason I answered his anti-UTF-16 rant at all
On Tue, May 16, 2017 at 6:23 AM, Karl Williamson
wrote:
> On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote:
>>
>> In reference to:
>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>
>> I think Unicode should not adopt the proposed change.
>>
>> The proposal is to make ICU's spe
47 matches
Mail list logo