On 3/18/2019 6:41 PM, Raymond Hettinger wrote:
We're having a super interesting discussion on
https://bugs.python.org/issue34160 . It is now marked as a release blocker and
warrants a broader discussion.
Our problem is that at least two distinct and important users have written
tests that depend on exact byte-by-byte comparisons of the final serialization.
So any changes to the XML modules will break those tests (not the applications
themselves, just the test cases that assume the output will be forever,
byte-by-byte identical).
In theory, the tests are incorrectly designed and should not treat the module
output as a canonical normal form. In practice, doing an equality test on the
output is the simplest, most obvious approach, and likely is being done in
other packages we don't know about yet.
With pickle, json, and __repr__, the usual way to write a test is to verify a
roundtrip: assert pickle.loads(pickle.dumps(data)) == data. With XML, the
problem is that the DOM doesn't have an equality operator. The user is left
with either testing specific fragments with element.find(xpath) or with using a
standards compliant canonicalization package (not available from us). Neither
option is pleasant.
The code in the current 3.8 alpha differs from 3.7 in that it removes attribute
sorting and instead preserves the order the user specified when creating an
element. As far as I can tell, there is no objection to this as a feature.
The problem is what to do about the existing tests in third-party code, what
guarantees we want to make going forward, and what do we recommend as a best
practice for testing XML generation.
Things we can do:
1) Revert back to the 3.7 behavior. This of course, makes all the test pass :-)
The downside is that it perpetuates the practice of bytewise equality tests
and locks in all implementation quirks forever. I don't know of anyone
advocating this option, but it is the simplest thing to do.
If it comes down to doing *something* to unblock the release ...
1b) Revert to 3.7 *and* document that byte equality with current ouput
is *not* guaranteed.
2). Go into every XML module and add attribute sorting options to each function
that generate xml. This gives users a way to make their tests pass for now.
There are several downsides. a) It grows the API in a way that is inconsistent
with all the other XML packages I've seen. b) We'll have to test, maintain, and
document the API forever -- the API is already large and time consuming to
teach. c) It perpetuates the notion that bytewise equality tests are the right
thing to do, so we'll have this problem again if substitute in another code
generator or alter any of the other implementation quirks (i.e. how CDATA
sections are serialized).
3) Add a standards compliant canonicalization tool (see
https://en.wikipedia.org/wiki/Canonical_XML ). This is likely to be the
right-way-to-do-it but takes time and energy.
4) Fix the tests in the third-party modules to be more focused on their actual
test objectives, the semantics of the generated XML rather than the exact
serialization. This option would seem like the right-thing-to-do but it isn't
trivial because the entire premise of the existing test is invalid. For every
case, we'll actually have to think through what the test objective really is.
Of these, option 2 is my least preferred. Ideally, we don't guarantee bytewise
identical output across releases, and ideally we don't grow a new API that
perpetuates the issue. That said, I'm not wedded to any of these options and
just want us to do what is best for the users in the long run.
The point of 1b would be to give us time to do that if more is needed.
Regardless of option chosen, we should make explicit whether on not the Python
standard library modules guarantee cross-release bytewise identical output for
XML. That is really the core issue here. Had we had an explicit notice one way
or the other, there wouldn't be an issue now.
I have not read the XML docs but based on this and the issue discussion
and what I think our general guarantee policy has been, I would consider
that there is not one. (I am thinking about things like garbage
collection, stable sorting, and set/dict iteration order.)
--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com