On Mon, Jan 15, 2018 at 11:15:01AM -0600, Eric Blake wrote: > On 01/15/2018 11:02 AM, Daniel P. Berrange wrote: > > Python2 did not validate locale correctness when reading input data, so > > would happily read UTF-8 data in non-UTF-8 locales. Python3 is strict so > > if you try to read UTF-8 data in the C locale, it will raise an error > > for any UTF-8 bytes that aren't representable in 7-bit ascii encoding. > > Urgh, that sounds like a Python bug. The C locale is defined by POSIX to > be 8-bit clean (ie. a superset of ascii with 256 characters, not strict > ascii with only 128 characters and 128 bytes that form encoding errors). > But that doesn't change the fact that we have to work around python's > braindead misinterpretation of reality.
FYI there is some background on this behaviour here: https://www.python.org/dev/peps/pep-0538/ NB that doc says the new C-is-UTF-8 assumpion is for Python 3.7 or later, but Fedora backported it to F27's Python 3.6 :-) The failure can be seen on Fedora with 3.0 -> 3.5 only. (BTW you can install many Python 3.x versions concurrently on Fedora which is handy for testing) > > e.g. > > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54: > > ordinal not in range(128) > > Traceback (most recent call last): > > File "/tmp/qemu-test/src/scripts/qapi-commands.py", line 317, in <module> > > schema = QAPISchema(input_file) > > File "/tmp/qemu-test/src/scripts/qapi.py", line 1468, in __init__ > > parser = QAPISchemaParser(open(fname, 'r')) > > File "/tmp/qemu-test/src/scripts/qapi.py", line 301, in __init__ > > previously_included) > > File "/tmp/qemu-test/src/scripts/qapi.py", line 348, in _include > > exprs_include = QAPISchemaParser(fobj, previously_included, info) > > File "/tmp/qemu-test/src/scripts/qapi.py", line 271, in __init__ > > self.src = fp.read() > > File "/usr/lib64/python3.5/encodings/ascii.py", line 26, in decode > > return codecs.ascii_decode(input, self.errors)[0] > > > > Many distros support a new C.UTF-8 locale that is like the C locale, > > but with UTF-8 instead of 7-bit ASCII. That is not entirely portable > > though, so this patch instead forces the en_US.UTF-8 locale, which > > is pretty similar but more widely available. > > > > We set LANG, rather than only LC_CTYPE, since generated source ought > > to be independant of all of the user's locale settings. > > s/independant/independent/ > > LANG is the lowest-priority setting - if the user has explicitly set > LC_CTYPE or LC_ALL, their settings override what is in LANG. > > > > > This patch only forces UTF-8 for QAPI scripts, since that is the one > > showing the immediate error under Python3 with C locale, but potentially > > we ought to force this for all python scripts used in the build process. > > > > Signed-off-by: Daniel P. Berrange <berra...@redhat.com> > > --- > > Makefile | 22 ++++++++++++---------- > > 1 file changed, 12 insertions(+), 10 deletions(-) > > > > diff --git a/Makefile b/Makefile > > index d86ecd2dd4..fde91cc42d 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -17,6 +17,8 @@ ifneq ($(wildcard config-host.mak),) > > all: > > include config-host.mak > > > > +PYTHON_UTF8 = LANG=en_US.UTF-8 $(PYTHON) > > I'm worried that this is not reproducible in the face of a user that > explicitly sets different locale env-vars with higher priority than LANG. You might remember a similar issue affecting libvirt-glib/libosinfo when glib-mkenums was rewritten to use Python instead of Perl. For that I ended up doing LC_ALL= LANG=C LC_CTYPE=en_US.UTF-8 > > + > > git-submodule-update: > > > > .PHONY: git-submodule-update > > @@ -471,17 +473,17 @@ qapi-py = $(SRC_PATH)/scripts/qapi.py > > $(SRC_PATH)/scripts/ordereddict.py > > > > qga/qapi-generated/qga-qapi-types.c qga/qapi-generated/qga-qapi-types.h :\ > > $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-types.py > > $(qapi-py) > > - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \ > > + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-types.py \ > > But once we agree on the right override to stuff into PYTHON_UTF8, the > rest of the patch converting invocations to PYTHON_UTF8 makes sense. Any thoughts on whether we should apply this more widely to our build to make its output predictable regardless of user's locale ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|