Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

Terry Reedy Mon, 18 Mar 2019 21:48:18 -0700

On 3/18/2019 6:41 PM, Raymond Hettinger wrote:

We're having a super interesting discussion on 
https://bugs.python.org/issue34160 .  It is now marked as a release blocker and 
warrants a broader discussion.


Our problem is that at least two distinct and important users have written 
tests that depend on exact byte-by-byte comparisons of the final serialization. 
 So any changes to the XML modules will break those tests (not the applications 
themselves, just the test cases that assume the output will be forever, 
byte-by-byte identical).

In theory, the tests are incorrectly designed and should not treat the module 
output as a canonical normal form.  In practice, doing an equality test on the 
output is the simplest, most obvious approach, and likely is being done in 
other packages we don't know about yet.

With pickle, json, and __repr__, the usual way to write a test is to verify a 
roundtrip:  assert pickle.loads(pickle.dumps(data)) == data.  With XML, the 
problem is that the DOM doesn't have an equality operator.  The user is left 
with either testing specific fragments with element.find(xpath) or with using a 
standards compliant canonicalization package (not available from us). Neither 
option is pleasant.

The code in the current 3.8 alpha differs from 3.7 in that it removes attribute 
sorting and instead preserves the order the user specified when creating an 
element.  As far as I can tell, there is no objection to this as a feature.  
The problem is what to do about the existing tests in third-party code, what 
guarantees we want to make going forward, and what do we recommend as a best 
practice for testing XML generation.

Things we can do:

1) Revert back to the 3.7 behavior. This of course, makes all the test pass :-) 
 The downside is that it perpetuates the practice of bytewise equality tests 
and locks in all implementation quirks forever.  I don't know of anyone 
advocating this option, but it is the simplest thing to do.


If it comes down to doing *something* to unblock the release ...

1b) Revert to 3.7 *and* document that byte equality with current ouputis *not* guaranteed.

2). Go into every XML module and add attribute sorting options to each function 
that generate xml.  This gives users a way to make their tests pass for now. 
There are several downsides. a) It grows the API in a way that is inconsistent 
with all the other XML packages I've seen. b) We'll have to test, maintain, and 
document the API forever -- the API is already large and time consuming to 
teach. c) It perpetuates the notion that bytewise equality tests are the right 
thing to do, so we'll have this problem again if substitute in another code 
generator or alter any of the other implementation quirks (i.e. how CDATA 
sections are serialized).

3) Add a standards compliant canonicalization tool (see 
https://en.wikipedia.org/wiki/Canonical_XML ).  This is likely to be the 
right-way-to-do-it but takes time and energy.

4) Fix the tests in the third-party modules to be more focused on their actual 
test objectives, the semantics of the generated XML rather than the exact 
serialization.  This option would seem like the right-thing-to-do but it isn't 
trivial because the entire premise of the existing test is invalid.  For every 
case, we'll actually have to think through what the test objective really is.

Of these, option 2 is my least preferred.  Ideally, we don't guarantee bytewise 
identical output across releases, and ideally we don't grow a new API that 
perpetuates the issue. That said, I'm not wedded to any of these options and 
just want us to do what is best for the users in the long run.


The point of 1b would be to give us time to do that if more is needed.

Regardless of option chosen, we should make explicit whether on not the Python 
standard library modules guarantee cross-release bytewise identical output for 
XML. That is really the core issue here.  Had we had an explicit notice one way 
or the other, there wouldn't be an issue now.

I have not read the XML docs but based on this and the issue discussionand what I think our general guarantee policy has been, I would considerthat there is not one. (I am thinking about things like garbagecollection, stable sorting, and set/dict iteration order.)


--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

Reply via email to