[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

Stefan Behnel Sat, 27 Apr 2019 11:42:34 -0700


Stefan Behnel <stefan...@behnel.de> added the comment:


This is a tricky decision. lxml, for example, validates user input, but that's 
because it has to process it anyway and does it along the way directly on input 
(and very efficiently in C code). ET, on the other hand, is rather lenient 
about what it allows users to do and doesn't apply much processing to user 
input. It even allows invalid trees during processing and only expects the tree 
to be serialisable when requested to serialise it.

I think that's a fair behaviour, because most user input will be ok and 
shouldn't need to suffer the performance penalty of validating all input. 
Null-characters are a very rare thing to find in text, for example, and I think 
it's reasonable to let users handle the few cases by themselves where they can 
occur.

Note that simply replacing invalid characters by the replacement character is 
not a good solution, at least not in the general case, since it silently 
corrupts data. It's probably a better solution for users to make their code 
scream out loudly when it has to deal with data that it cannot serialise in the 
end, and to do that early on input (where its easy to debug) rather than late 
on serialisation where it might be difficult to understand how the data became 
what it is. Trying to serialise a null-character seems only a symptom of a more 
important problem somewhere else in the processing pipeline.

In the end, users who *really* care about correct output should run some kind 
of schema validation over it *after* serialisation, as that would detect not 
only data issues but also structural and logical issues (such as a missing or 
empty attribute), specifically for their target data format. In some cases, it 
might even detect random data corruption due to old non-ECC RAM in the server 
machine. :)

So, if someone finds a way to augment the text escaping procedure with a bit of 
character validation without making it slower (especially for the extremely 
common very short strings), then I think we can reconsider this as an 
enhancement. Until then, and seeing that no-one has come up with a patch in the 
last 10 years, I'll close this as "won't fix".

----------
dependencies:  -Document Object Model API - validation
nosy: +scoder
resolution:  -> wont fix
stage:  -> resolved
status: open -> closed
versions: +Python 3.8 -Python 3.4, Python 3.5

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue5166>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

Reply via email to