[issue18850] xml.etree.ElementTree accepts control chars.

2013-09-02 Thread Eli Bendersky

Eli Bendersky added the comment:

As Serhiy points out, this is a duplicate of #5166

--
superseder:  -> ElementTree and minidom don't prevent creation of not 
well-formed XML

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-09-02 Thread Eli Bendersky

Changes by Eli Bendersky :


--
resolution:  -> duplicate
stage: needs patch -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-09-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Isn't this a duplicate of issue5166?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-09-01 Thread Eli Bendersky

Eli Bendersky added the comment:

Can this be transformed into a new issue that succinctly summarizes what the 
new requested feature is, and why it's useful?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-29 Thread Stefan Behnel

Stefan Behnel added the comment:

> As an advice I hope you do not take as insult, saying
> "in section {section} the spec says {argument}" 
> is much more constructive than 
> "read the spec on that", "{extremely_obvious_link}",
> at least to people not familiar with the spec and asking for the source > of 
> your arguments (msg196360). Can shorten threads, too.

No harm done. The reason why I just posted the spec URL is that it's actually 
the entire spec that backs the argument. XML is (essentially) specified as a 
mapping from a sequence of bytes to a hierarchical structure (and back again). 
That's why there is an XML declaration header that names the encoding, for 
example. It wouldn't be needed if XML was defined as Unicode data.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-29 Thread Michele Orrù

Michele Orrù added the comment:

> Is that you actual use case? That you *want* to store binary data in XML, 
> instead of getting it properly rejected as non well-formed content?
No, Stefan.

What I was saying in my last message was  just "you're right, the user shall 
always use repr() when printing an xml tree" (msg196313) because "xml does 
*not* guarantee to have only printable chars by itself" (msg196368, msg196379).

As an advice I hope you do not take as insult, saying
"in section {section} the spec says {argument}" 
is much more constructive than 
"read the spec on that", "{extremely_obvious_link}",
at least to people not familiar with the spec and asking for the source of your 
arguments (msg196360). Can shorten threads, too.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Stefan Behnel

Stefan Behnel added the comment:

Is that you actual use case? That you *want* to store binary data in XML, 
instead of getting it properly rejected as non well-formed content?

Then I suggest going the canonical route of passing it through base64 first, or 
any of the other binary-to-characters encoding.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Antoine Pitrou

Changes by Antoine Pitrou :


--
nosy: +christian.heimes

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Michele Orrù

Michele Orrù added the comment:

Just pointed by a friend - 
I suppose this is insanely used to put binary blobs inside xml until "only the 
CDEnd string is recognized as markup".
That's what I needed. 

Amen.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Michele Orrù

Michele Orrù added the comment:

>I said that (serialised) XML is defined as a sequence of bytes. 
> Read the spec on that.
And I'm saying that's inexact. I have expectations that control chars are 
escaped in the serialized xml, because the spec I'm reading says so, and 
because the documentation does not warn me about this.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Stefan Behnel

Stefan Behnel added the comment:

We are talking about two different things here.

I said that (serialised) XML is defined as a sequence of bytes. Read the spec 
on that.

What you are talking about is the Infoset, or the parsed/generated in-memory 
XML tree. That's obviously not bytes, it's defined based on Unicode. Parsing 
and serialising does the mapping here.

The "attack" that you presented is based on serialised XML, thus on a sequence 
of bytes. What I am saying is that this "attack" can be done by any kind of 
binary data, so it's not XML specific, thus not a problem with ElementTree.

The fact that ElementTree allows you to generate non well-formed 'XML' 
containing control characters when you tell it to do so is unfortunate, but 
it's neither a security risk (you already had the non well-formed content in 
your hands *before* you passed it into ElementTree), nor clearly a bug, because 
the user specifically requested the serialisation of an in-memory tree that 
contained these control characters.

But, once again, it would be nice if ElementTree rejected this input in one way 
or another, and that's a feature request.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Michele Orrù

Michele Orrù added the comment:

Does not seem to me just a "byte string" where you can put binary data. 
Hence, I expect the xml tree to escape/reject those. 
Hence, Is not an enhancement, but a bug. Unless we just want to document this.

(not going to change the metadata, otherwise we'll end up changing on each 
message).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Michele Orrù

Michele Orrù added the comment:

"""
Document authors are encouraged to avoid "compatibility characters", as defined 
in section 2.3 of [Unicode]. The characters defined in the following ranges are 
also discouraged. They are either control characters or permanently undefined 
Unicode characters:
"""

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Michele Orrù

Michele Orrù added the comment:

>>> XML is *defined* as a stream of bytes.
>> Can you *paste* the *source* proving what you are arguing, please?
> http://www.w3.org/TR/REC-xml/
"""
The first two suggestions are directly derived from the rules given for 
identifiers in Standard Annex #31 (UAX #31) of the Unicode Standard, version 
5.0 [Unicode], and exclude all control characters, enclosing nonspacing marks, 
non-decimal numbers, private-use characters, punctuation characters (with the 
noted exceptions), symbol characters, unassigned codepoints, and white space 
characters. The other suggestions are mostly derived from Appendix B in 
previous editions of this specification.
"""

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Stefan Behnel

Stefan Behnel added the comment:

>> XML is *defined* as a stream of bytes.
> Can you *paste* the *source* proving what you are arguing, please?

http://www.w3.org/TR/REC-xml/


> python3 works with ElementTree(bytes(unicode))

What does this sentence mean?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Michele Orrù

Michele Orrù added the comment:

> XML is *defined* as a stream of bytes.
Can you *paste* the *source* proving what you are arguing, please?  

> Regarding the API side in ElementTree, Py2 accepts byte strings and Py3 
> requires Unicode strings.
"accepts"? python3 works with ElementTree(bytes(unicode))

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Michele Orrù

Michele Orrù added the comment:

> Incidentally I read today 
> http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html
>  mentioning ^A being used. 
> Maybe that would stop working?
I don't see any problem in any xml output. Indeed:
"You can't put a nasty non-printing ASCII character in XML, so they were 
switched to something else. That is my working theory for this anyway."

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Stefan Behnel

Stefan Behnel added the comment:

> I think the point here is clarifying whether xml expect text or just a byte 
> string. In case that's a stream of byte, I agree with you, is more a 
> "behaviour" problem.

XML is *defined* as a stream of bytes.

Regarding the API side in ElementTree, Py2 accepts byte strings and Py3 
requires Unicode strings. Py2 will not change in that regard, and I can't see 
this being a serious enough issue to change the ET-API there, so IMHO we can 
ignore Py2.x completely for this issue. (changing ticket accordingly)

However, Py3 will happily write out control characters if they appear in the 
Unicode text string, so the issue is the same there. A fix for Py3 would be to 
add an input validation step, preferably at serialisation time.

--
type: security -> enhancement
versions:  -Python 2.6, Python 2.7, Python 3.2, Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-28 Thread Michele Orrù

Michele Orrù added the comment:

> The parser *is* rejecting control characters. It's an XML parser. See the 
> example in the link you posted.
Ehrm, my apologies.

> That's not an XML specific issue. You are printing a byte string here, so 
> repr() would be the right thing to use (and is actually being used 
> automatically in 
> Py3), instead of plain printing. The fact that you are wrapping the content 
> in XML doesn't matter.
[citation needed] 
After a quick scan in the documentation I did not see anything mentioning this. 
Instead, I see many cases in which escape chars and binary-to-text encodings 
are mentioned.

> What I meant was: at what step of the process from creating an XML tree in 
> memory to serialisation is it a problem that the tree contains control 
> characters? 
> Because once the data is serialised, it will just be rejected on input by any 
> XML parser, and handling bytes data is a thing on its own (e.g. you could 
> serialise 
> to UTF16 and the result would contain null bytes - too bad).
m, I think the problem lies in the expectation of having 
fromstring(tostring(tree)) = tree

> Unless there is a more dangerous way to exploit this that is actually due to 
> XML being used, I'd suggest changing the type from "security" back to 
> "behaviour".
> Or maybe even to "enhancement". The behaviour that it writes out what you 
> give it isn't exactly wrong, it's just inconvenient that you have to take 
> care yourself 
> that you pass it well-formed XML content.
I think the point here is clarifying whether xml expect text or just a byte 
string. In case that's a stream of byte, I agree with you, is more a 
"behaviour" problem.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Martin Mokrejs

Martin Mokrejs added the comment:

Incidentally I read today 
http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html
 mentioning ^A being used. Maybe that would stop working?

--
nosy: +mmokrejs

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Stefan Behnel

Stefan Behnel added the comment:

Or maybe even to "enhancement". The behaviour that it writes out what you give 
it isn't exactly wrong, it's just inconvenient that you have to take care 
yourself that you pass it well-formed XML content.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Stefan Behnel

Stefan Behnel added the comment:

> The parser is *not* rejecting control chars.

The parser *is* rejecting control characters. It's an XML parser. See the 
example in the link you posted.


> assume you have a script that simply stores each message it receives (from 
> stdin, from a tcp stream, whatever) inside an xml tree like 
> '{message1}{message2}',
> and prints the tree on SIGINT.

That's not an XML specific issue. You are printing a byte string here, so 
repr() would be the right thing to use (and is actually being used 
automatically in Py3), instead of plain printing. The fact that you are 
wrapping the content in XML doesn't matter.


>> What part of the create-to-serialise process exactly is a problem here?
> ElementTree.tostring().

What I meant was: at what step of the process from creating an XML tree in 
memory to serialisation is it a problem that the tree contains control 
characters? Because once the data is serialised, it will just be rejected on 
input by any XML parser, and handling bytes data is a thing on its own (e.g. 
you could serialise to UTF16 and the result would contain null bytes - too bad).

It may just be a bad example that you chose here, but I really can't see this 
being a security problem. You are mishandling arbitrary untrusted binary data, 
that's all. Control characters are most likely not the only problem that you 
should guard against.

Unless there is a more dangerous way to exploit this that is actually due to 
XML being used, I'd suggest changing the type from "security" back to 
"behaviour".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Michele Orrù

Michele Orrù added the comment:

> Michele, could you elaborate how you would exploit this issue as a security 
> risk?
Sure. What I meant in my message is: assume you have a script that simply 
stores each message it receives (from stdin, from a tcp stream, whatever) 
inside an xml tree like 
'{message1}{message2}',
and prints the tree on SIGINT.

What I would expect is the xml document not to allow control chars, as 
"restricted and discouraged", and consistent with lxml. 
What instead happens is that the control chars are not handled, and thus 
anybody can send control chars in my terminal. Changing the terminal title is a 
trivial example of those. 

For sure an echo server may have the same issue, but the premises are 
different, because I expect to print just a byte stream.
Mentioning this fact in the documentation may be a possible solution, but I 
believe more that keeping consistency with lxml is the right way.

> I mean, I can easily create a (non-)XML-document with control characters 
> manually, and the parser would reject it.
False? The parser is *not* rejecting control chars.
 
> What part of the create-to-serialise process exactly is a problem here?
ElementTree.tostring().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Stefan Behnel

Stefan Behnel added the comment:

Michele, could you elaborate how you would exploit this issue as a security 
risk?

I mean, I can easily create a (non-)XML-document with control characters 
manually, and the parser would reject it.

What part of the create-to-serialise process exactly is a problem here?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

See also issue7727. Almost any other XML generation code 
(xml.sax.sautils.XMLGenerator, xml.dom.minidom.Element.writexml(), etc, but not 
plistlib.PlistWriter) has the same problem.

The problem with filtering control characters is that it will significantly 
slowdown XML generation.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread R. David Murray

R. David Murray added the comment:

In that case, the fix needs to be applied to 3.2 and 2.6 as well.  Or at least 
considered for application.  It could be that this will break working (though 
dangerous) programs.  I'll leave it to folks more knowledgeable in this 
particular area than I to decide where the severity/backward compatibility line 
lies in this case.

--
stage:  -> needs patch
type: behavior -> security
versions: +Python 2.6, Python 3.2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Michele Orrù

Michele Orrù added the comment:

I suppose it is, David, if in 2 minutes flat I can change your terminal name.

--
Added file: http://bugs.python.org/file31484/inject.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread R. David Murray

R. David Murray added the comment:

Unless it is a security issue, this seems like the kind of fix that shouldn't 
be applied to maintenance releases.

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Stefan Behnel

Stefan Behnel added the comment:

Go for it. That's usually the fastest way to get things done.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Michele Orrù

Michele Orrù added the comment:

you mind if I try by myself to provide patch and unittest in the next few days?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Stefan Behnel

Stefan Behnel added the comment:

This is a bit tricky in ET because it generally allows you to stick anything 
into the Element properties (and that's a feature). So catching this at tree 
building time (as lxml.etree does) isn't really possible.

However, at least catching it in the serialiser should be possible and would 
help. Regexes for well-formed tag names and text could do the job.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
nosy: +eli.bendersky, scoder, serhiy.storchaka
versions: +Python 2.7, Python 3.3, Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Michele Orrù

Changes by Michele Orrù :


--
components: +Library (Lib), XML
type:  -> behavior

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18850] xml.etree.ElementTree accepts control chars.

2013-08-27 Thread Michele Orrù

New submission from Michele Orrù:

Got from irc; 
 python bug in xml.etree.ElementTree, from version 2.7 to 3.2 
http://www.reddit.com/r/Python/comments/1l6cta/python_bug_in_xmletreeelementtree/

I think we should keep consistency with lxml and forbid control chars in 
advance. 


Attaching the test file proving the issue, slightly modified to work on 2.7 and 
3.x

--
files: test.py
messages: 196271
nosy: maker
priority: normal
severity: normal
status: open
title: xml.etree.ElementTree accepts control chars.
Added file: http://bugs.python.org/file31482/test.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com