Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits::eof()

2017-06-02 Thread Jonathan Wakely

On 05/05/17 18:05 +0100, Jonathan Wakely wrote:

As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits::eof() because it returns int_type(-1) which is
the same value as u'\u', a valid UTF-16 code point.

i.e. because all values of int_type are also valid values of char_type
we cannot meet the requirement that:

"The member eof() shall return an implementation-defined constant
that cannot appear as a valid UTF-16 code unit."

I've reported this as a defect, suggesting that the wording above
needs to change.

One consequence is that basic_streambuf::sputc(u'\u')
always returns the same value, whether it succeeds or not. On success
it returns to_int_type(u'\u') and on failure it returns eof(),
which is the same value. I think that can be solved with the attached
change, which preserves the invariant in [char.traits.require] that
eof() returns:

"a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
all values c."

This can be true if we ensure that to_int_type never returns the eof()
value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
doing something like this.

It means that when writing u'\u' to a streambuf we write that
character successfully, but return u'\uFFFD' instead; and when reading
u'\u' from a streambuf we return u'\uFFFD' instead. This is
asymmetrical, as we can write that character but not read it back.  It
might be better to refuse to write u'\u' and write it as the
replacement character instead, but I think I prefer to write the right
character when possible. It also doesn't require any extra changes.

All tests pass with this, does anybody see any problems with this
approach?





commit 8ab705e4920e933d3b0e90fd004b93d89aab8619
Author: Jonathan Wakely 
Date:   Fri May 5 16:57:07 2017 +0100

   PR libstdc++/80624 satisfy invariant for char_traits::eof()
   
   	PR libstdc++/80624

* doc/xml/manual/status_cxx2011.xml: Document to_int_type behaviour.
* include/bits/char_traits.h (char_traits::to_int_type):
Transform eof value to U+FFFD.
* testsuite/21_strings/char_traits/requirements/char16_t/eof.cc: New.
* testsuite/27_io/basic_streambuf/sgetc/char16_t/80624.cc: New.
* testsuite/27_io/basic_streambuf/sputc/char16_t/80624.cc: New.


I've committed this now. I'll work with WG21 to resolve
https://wg21.link/lwg2959 and if a better solution is found, we can do
that instead. Until then getting some implementation and usage
experience of this solution seems valuable.




Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits::eof()

2017-05-08 Thread Jonathan Wakely via gcc-patches

On 08/05/17 12:52 +0200, Florian Weimer via libstdc++ wrote:

On 05/08/2017 12:24 PM, Jonathan Wakely wrote:

On 08/05/17 11:53 +0200, Florian Weimer via libstdc++ wrote:

On 05/05/2017 07:05 PM, Jonathan Wakely wrote:

As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits::eof() because it returns int_type(-1) which is
the same value as u'\u', a valid UTF-16 code point.


I think the real bug is that char_traits::int_type is 
just plain wrong.  It has to be a signed integer,


Why does it have to be signed?


Hmm.  Maybe it's not strictly required.  int_type(-1) as a distinct 
value is likely sufficient.


Agreed.

and capable of representing values in the range 0 .. 65535.  
char_traits has a similar problem.  char_traits 
should be fine on glibc because WEOF is reserved, something that 
is probably not the case for char32_t.


I think there are 32-bit values which are not valid UTF-32 code
points, including char32_t(-1) which we use for EOF.


I'm not sure if char32_t is restricted to UTF-32 codepoints (the 
standard does not say, I think).  But even UCS-4 is 31-bit only, so 
maybe the problem does not arise there.


It's not really clear what the encoding of char32_t is (see
http://talesofcpp.fusionfenix.com/post-10/episode-seven-one-char-to-rule-them-all
for a good analysis) but whether it's UCS-4 or UTF-32, U+ is
not in the universal character set, so we can use 0x for
char_traits::eof().

So I think only char_traits has this problem.



Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits::eof()

2017-05-08 Thread Florian Weimer via gcc-patches

On 05/08/2017 12:24 PM, Jonathan Wakely wrote:

On 08/05/17 11:53 +0200, Florian Weimer via libstdc++ wrote:

On 05/05/2017 07:05 PM, Jonathan Wakely wrote:

As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits::eof() because it returns int_type(-1) which is
the same value as u'\u', a valid UTF-16 code point.


I think the real bug is that char_traits::int_type is just 
plain wrong.  It has to be a signed integer,


Why does it have to be signed?


Hmm.  Maybe it's not strictly required.  int_type(-1) as a distinct 
value is likely sufficient.


and capable of representing values in the range 0 .. 65535.  
char_traits has a similar problem.  char_traits 
should be fine on glibc because WEOF is reserved, something that is 
probably not the case for char32_t.


I think there are 32-bit values which are not valid UTF-32 code
points, including char32_t(-1) which we use for EOF.


I'm not sure if char32_t is restricted to UTF-32 codepoints (the 
standard does not say, I think).  But even UCS-4 is 31-bit only, so 
maybe the problem does not arise there.


Thanks,
Florian


Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits::eof()

2017-05-08 Thread Jonathan Wakely via gcc-patches

On 08/05/17 11:53 +0200, Florian Weimer via libstdc++ wrote:

On 05/05/2017 07:05 PM, Jonathan Wakely wrote:

As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits::eof() because it returns int_type(-1) which is
the same value as u'\u', a valid UTF-16 code point.


I think the real bug is that char_traits::int_type is just 
plain wrong.  It has to be a signed integer,


Why does it have to be signed?

and capable of 
representing values in the range 0 .. 65535.  char_traits 
has a similar problem.  char_traits should be fine on glibc 
because WEOF is reserved, something that is probably not the case for 
char32_t.


I think there are 32-bit values which are not valid UTF-32 code
points, including char32_t(-1) which we use for EOF.



Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits::eof()

2017-05-08 Thread Jonathan Wakely via gcc-patches

On 08/05/17 09:52 +0200, Stephan Bergmann via libstdc++ wrote:

On 05/05/2017 07:05 PM, Jonathan Wakely wrote:

As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits::eof() because it returns int_type(-1) which is
the same value as u'\u', a valid UTF-16 code point.

i.e. because all values of int_type are also valid values of char_type
we cannot meet the requirement that:

"The member eof() shall return an implementation-defined constant
that cannot appear as a valid UTF-16 code unit."

I've reported this as a defect, suggesting that the wording above
needs to change.

One consequence is that basic_streambuf::sputc(u'\u')
always returns the same value, whether it succeeds or not. On success
it returns to_int_type(u'\u') and on failure it returns eof(),
which is the same value. I think that can be solved with the attached
change, which preserves the invariant in [char.traits.require] that
eof() returns:

"a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
all values c."

This can be true if we ensure that to_int_type never returns the eof()
value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
doing something like this.

It means that when writing u'\u' to a streambuf we write that
character successfully, but return u'\uFFFD' instead; and when reading
u'\u' from a streambuf we return u'\uFFFD' instead. This is
asymmetrical, as we can write that character but not read it back.  It
might be better to refuse to write u'\u' and write it as the
replacement character instead, but I think I prefer to write the right
character when possible. It also doesn't require any extra changes.

All tests pass with this, does anybody see any problems with this
approach?


Sounds scary to me.  As an application programmer, I'd expect to be 
able to use chart16_t based streams to read and write arbitrary 
sequences of Unicode code points (encoded as sequences of UTF-16 code 
units).  (Think of an application temporarily storing internal strings 
to a disk file.)


Also, I'd be surprised to find this asymmetric behavior only for 
U+ and not for other noncharacters, and only for char16_t and not 
for char32_t.


To me, the definition of char16_t's int_type and eof() sounds like a 
bug that needs fixing, not working around?


Fixing that would require changing the standard and breaking the ABI
of all existing implementations. I've opened a defect report against
that standard, but a change that requires an ABI break isn't likely to
be popular.

Changing the semantics of to_int_type for U+ is far less likely to
affect any ABIs (it's a constexpr function so it's possible somebody
is using the value of to_int_type(char_type(-1)) as a template
argument, but it seems unlikely. It's a much smaller change, "allowed"
by http://www.unicode.org/faq/private_use.html#nonchar10 and it only
affects a noncharacter that is not intended for interchange anyway.

I'm not claiming it's ideal, but it fixes a bug today.



Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits::eof()

2017-05-08 Thread Florian Weimer via gcc-patches

On 05/05/2017 07:05 PM, Jonathan Wakely wrote:

As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits::eof() because it returns int_type(-1) which is
the same value as u'\u', a valid UTF-16 code point.


I think the real bug is that char_traits::int_type is just 
plain wrong.  It has to be a signed integer, and capable of representing 
values in the range 0 .. 65535.  char_traits has a similar 
problem.  char_traits should be fine on glibc because WEOF is 
reserved, something that is probably not the case for char32_t.


Thanks,
Florian


Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits::eof()

2017-05-08 Thread Stephan Bergmann via gcc-patches

On 05/05/2017 07:05 PM, Jonathan Wakely wrote:

As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits::eof() because it returns int_type(-1) which is
the same value as u'\u', a valid UTF-16 code point.

i.e. because all values of int_type are also valid values of char_type
we cannot meet the requirement that:

"The member eof() shall return an implementation-defined constant
that cannot appear as a valid UTF-16 code unit."

I've reported this as a defect, suggesting that the wording above
needs to change.

One consequence is that basic_streambuf::sputc(u'\u')
always returns the same value, whether it succeeds or not. On success
it returns to_int_type(u'\u') and on failure it returns eof(),
which is the same value. I think that can be solved with the attached
change, which preserves the invariant in [char.traits.require] that
eof() returns:

"a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
all values c."

This can be true if we ensure that to_int_type never returns the eof()
value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
doing something like this.

It means that when writing u'\u' to a streambuf we write that
character successfully, but return u'\uFFFD' instead; and when reading
u'\u' from a streambuf we return u'\uFFFD' instead. This is
asymmetrical, as we can write that character but not read it back.  It
might be better to refuse to write u'\u' and write it as the
replacement character instead, but I think I prefer to write the right
character when possible. It also doesn't require any extra changes.

All tests pass with this, does anybody see any problems with this
approach?


Sounds scary to me.  As an application programmer, I'd expect to be able 
to use chart16_t based streams to read and write arbitrary sequences of 
Unicode code points (encoded as sequences of UTF-16 code units).  (Think 
of an application temporarily storing internal strings to a disk file.)


Also, I'd be surprised to find this asymmetric behavior only for U+ 
and not for other noncharacters, and only for char16_t and not for char32_t.


To me, the definition of char16_t's int_type and eof() sounds like a bug 
that needs fixing, not working around?


[PATCH] PR libstdc++/80624 satisfy invariant for char_traits::eof()

2017-05-05 Thread Jonathan Wakely

As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits::eof() because it returns int_type(-1) which is
the same value as u'\u', a valid UTF-16 code point.

i.e. because all values of int_type are also valid values of char_type
we cannot meet the requirement that:

"The member eof() shall return an implementation-defined constant
that cannot appear as a valid UTF-16 code unit."

I've reported this as a defect, suggesting that the wording above
needs to change.

One consequence is that basic_streambuf::sputc(u'\u')
always returns the same value, whether it succeeds or not. On success
it returns to_int_type(u'\u') and on failure it returns eof(),
which is the same value. I think that can be solved with the attached
change, which preserves the invariant in [char.traits.require] that
eof() returns:

"a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
all values c."

This can be true if we ensure that to_int_type never returns the eof()
value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
doing something like this.

It means that when writing u'\u' to a streambuf we write that
character successfully, but return u'\uFFFD' instead; and when reading
u'\u' from a streambuf we return u'\uFFFD' instead. This is
asymmetrical, as we can write that character but not read it back.  It
might be better to refuse to write u'\u' and write it as the
replacement character instead, but I think I prefer to write the right
character when possible. It also doesn't require any extra changes.

All tests pass with this, does anybody see any problems with this
approach?


commit 8ab705e4920e933d3b0e90fd004b93d89aab8619
Author: Jonathan Wakely 
Date:   Fri May 5 16:57:07 2017 +0100

PR libstdc++/80624 satisfy invariant for char_traits::eof()

	PR libstdc++/80624
	* doc/xml/manual/status_cxx2011.xml: Document to_int_type behaviour.
	* include/bits/char_traits.h (char_traits::to_int_type):
	Transform eof value to U+FFFD.
	* testsuite/21_strings/char_traits/requirements/char16_t/eof.cc: New.
	* testsuite/27_io/basic_streambuf/sgetc/char16_t/80624.cc: New.
	* testsuite/27_io/basic_streambuf/sputc/char16_t/80624.cc: New.

diff --git a/libstdc++-v3/doc/xml/manual/status_cxx2011.xml b/libstdc++-v3/doc/xml/manual/status_cxx2011.xml
index 705f2ee..0fa4bc0 100644
--- a/libstdc++-v3/doc/xml/manual/status_cxx2011.xml
+++ b/libstdc++-v3/doc/xml/manual/status_cxx2011.xml
@@ -2630,6 +2630,10 @@ particular release.
   u32streampos are both synonyms for
   fpos.
   The function eof returns int_type(-1).
+  char_traits::to_int_type will
+  transform the "noncharacter" U+ to U+FFFD (REPLACEMENT CHARACTER).
+  This is done to ensure that to_int_type never
+  returns the same value as eof, which is U+.

 

diff --git a/libstdc++-v3/include/bits/char_traits.h b/libstdc++-v3/include/bits/char_traits.h
index 75db5b8..f19120b 100644
--- a/libstdc++-v3/include/bits/char_traits.h
+++ b/libstdc++-v3/include/bits/char_traits.h
@@ -507,7 +507,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
   static constexpr int_type
   to_int_type(const char_type& __c) noexcept
-  { return int_type(__c); }
+  { return __c == eof() ? int_type(0xfffd) : int_type(__c); }
 
   static constexpr bool
   eq_int_type(const int_type& __c1, const int_type& __c2) noexcept
diff --git a/libstdc++-v3/testsuite/21_strings/char_traits/requirements/char16_t/eof.cc b/libstdc++-v3/testsuite/21_strings/char_traits/requirements/char16_t/eof.cc
new file mode 100644
index 000..05def7f
--- /dev/null
+++ b/libstdc++-v3/testsuite/21_strings/char_traits/requirements/char16_t/eof.cc
@@ -0,0 +1,31 @@
+// Copyright (C) 2017 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// .
+
+// { dg-do compile { target c++11 } }
+
+#include 
+
+
+constexpr bool not_equal_to_eof(char16_t c)
+{
+  using T = std::char_traits;
+  return T::eq_int_type(T::eof(), T::to_int_type(c)) == false;
+}
+
+// Last two code points of the BMP are noncharacters:
+static_assert(not_equal_to_eof(u'\uFFFE'), "U+FFFE compares unequal to eof");
+static_assert(not_equal_to_eof(u'\u'), "U+