Re: Python 3 - xml - crlf handling problem

Stefan Behnel Fri, 02 Dec 2011 03:27:22 -0800

durumdara, 02.12.2011 09:13:

So: may I don't understand the things well, but I thought that parser
drop the "nondata" CRLF-s + other characters (not preserve them).


Well, it does that, at least on my side (which is not under Windows):

===================
original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
    <element a="1">
        AnyText
    </element>
</doc>
'''

from xml.dom.minidom import parse

def main():
    f = open('test.0.xml', 'wb')
    f.write(original.strip().replace('\n', '\r\n').encode('utf8'))
    f.close()

    xo = parse('test.0.xml')
    de = xo.documentElement
    print(repr(de.childNodes[0].nodeValue))
    print(repr(de.childNodes[1].childNodes[0].nodeValue))

if __name__ == '__main__':
    main()
===================

This prints '\n ' and '\n AnyText\n ' on my side, so thewhitespace normalisation in the parser properly did its work.

Then don't matters that I read the XML from a file, or I create it
from code, because all of them generating SAME RESULT.
But Python don't do that.
If I make xml from code, the code is without plus characters.

What do you mean by "plus characters"? It's not the "+" character that youare referring to, right? Do you mean additional characters? Such as theadditional '\r'?

But Python preserves parsed CRLF characters somewhere, and they are
also flushing into the result.

Example:

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
     <element a="1">
         AnyText
     </element>
</doc>
'''

If I parse this, and write with toxml, the CRLF-s remaining in the
code, but if I create this document line by line, there is no CRLF,
the toxml write "only lined" xml.

This also meaning that if I use prettyxml call, to prettying the xml,
the file size is growing.

If there is a multiple processing queue - if two pythons communicating
in xml files, the size can growing every time.

Py1 - read the Py2's file, process it, and write to a result file
Py2 - read the Py1's result file, process it, and pass back to Py1
this can grow the file with each call, because "pretty" CRLF-s not
normalized out from the code.

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
     <element a="1">
         AnyText
     </element>
</doc>
'''

def main():
     f = open('test.0.xml','w')
     f.write(original.strip())
     f.close()

     for i in range(1, 10 + 1):
         xo = parse('test.%d.xml' % (i - 1))
         de = xo.documentElement
         de.setAttribute('c', str(i))
         t = de.getElementsByTagName('element')[0]
         tn = t.childNodes[0]
         print (dir(t))
         print (tn)
         print (tn.nodeValue)
         tn.nodeValue = str(i) + '\t' + '\n'
         #s = xo.toxml()
         s = xo.toprettyxml()
         f = open('test.%d.xml' % i,'w')
         f.write(s)
         f.close()

     sys.exit()

And: because Python is not converting CRLF to&013; I cannot make
different from "prettied source's CRLF" (loaded from template file),
"my own pretty's CRLF" (my own topretty), and really contained CRLF
(for example a memo field's value).

My case is that the processor application (for whom I pass the XML
from Python) is sensitive to "plus CRLF"-s in text nodes, I must do
something these "plus" items to avoid external's program errors.

I got these templates and input files from prettied format (with
CRLFS), but I must "eat" them to make an XML that one lined if
possible.

I hope you understand my problem with it.

Still not quite, but never mind. May or may not be a problem in minidom oryour code. For example, you shouldn't open the output file in text mode butin binary mode (i.e. "wb") because you are writing bytes into it.

Here's what I tried with ElementTree, and it seems to do what your codeabove wants. The indent() function is taken from Fredrik's element lib page:


http://effbot.org/zone/element-lib.htm

========================
original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
    <element a="1">
        AnyText
    </element>
</doc>
'''

def indent(elem, level=0):
    i = "\n" + level*"  "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

def main():
    f = open('test.0.xml','w', encoding='utf8')
    f.write(original.strip())
    f.close()

    from xml.etree.cElementTree import parse

    for i in range(10):
        tree = parse('test.%d.xml' % i)
        root = tree.getroot()
        root.set('c', str(i+1))
        t = root.find('.//element')
        t.text = '%d\t\n' % (i+1)
        indent(root)
        tree.write('test.%d.xml' % (i+1), encoding='utf8')

if __name__ == '__main__':
    main()
========================

Hope that helps,

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Python 3 - xml - crlf handling problem

Reply via email to