Re: toprettyxml messes up with whitespaces

2007-10-09 Thread Jorgen Bodde
Dear list,

Thanks for the suggestions and clarification. After playing with XML
for a while I noticed whitespaces can indeed be more important then I
thought. I did came to the following conclusions;

1. Removing whitespaces was done by my code, not by the
xml.dom.minidom so I regret the fact I said that it removed
whitespaces automatically
2. toprettyxml() should however be smarter with outputting the XML. If
it adds whitespaces in the sake of formatting, it should check how
many of the whitespaces are already there. Consecutive read / modify /
write actions should not cause an explosive growth of whitespaces.
When I use toprettyxml() I am obviously not interested in whitespaces
in front of the text in the nodes, or else I would have outputted it
differently.

Thanks all for the feedback,
- Jorgen
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: toprettyxml messes up with whitespaces

2007-10-03 Thread Jorgen Bodde
Hi there,

Thank you for confirming this, I did manage a work around. When
reading back the XML file, I strip it off it's whitespaces before I
parse it. Then when writing it back no excessive whitespaces are
appended. My best guess is that toprettyxml is not intelligently
handling whitespaces that are already there, and bluntly appends more
whitespaces to it, making it grow exponentially.

This is the snippet;

f = open(filename, rt)
for line in f:
s = line.strip(' \t\n')
if s:
xmlstr += s + ' ' # space needed for spanning text nodes

And then I simply use parseString instead of parse. But honestly, I
think it is a bug, because the XML standard also says that whitespaces
before normal text should be ignored, and I do not see it back as text
when I read the node, so why preserve it and mess up the formatting in
the end?

Regards,
- Jorgen




On 10/2/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 On Oct 2, 11:43 am, Jorgen Bodde [EMAIL PROTECTED] wrote:
  Hi all,
 
  I parse an XML file, replace a node with a new one (like updating
  cache) and write it back. Every write, new spaces are added. For
  example, first read - update - write cycle;
 
  var name=APPNAME status=undefined
   My First App
  /var
 
  Second cycle:
 
  var name=APPNAME status=undefined
   My First App
  /var
 
  Third cycle:
 
  var name=APPNAME status=undefined
 My First App
  /var
 
  And this goes on. The node is one that is not touched in the XML, it
  is simply written back after reading. I have the same with void spaces
  in between the nodes, I managed to compensate that by stripping the
  lines.
 
  I would like to use toprettyxml to make it user editable and viewable.
  But this is really weird. How can I circumvent this behaviour?
 
  regards,
  - Jorgen

 I had similar problems and ended up switching to the lxml package to
 solve the issue. I think you can do it with ElementTree too. Maybe
 somebody with more experience with the xml / minidom modules will show
 up soon.

 Mike

 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: toprettyxml messes up with whitespaces

2007-10-03 Thread Paul Boddie
On 3 Okt, 11:30, Jorgen Bodde [EMAIL PROTECTED] wrote:

 Thank you for confirming this, I did manage a work around. When
 reading back the XML file, I strip it off it's whitespaces before I
 parse it. Then when writing it back no excessive whitespaces are
 appended. My best guess is that toprettyxml is not intelligently
 handling whitespaces that are already there, and bluntly appends more
 whitespaces to it, making it grow exponentially.

This seems like a reasonable explanation without having looked at the
source code myself.

[...]

 And then I simply use parseString instead of parse. But honestly, I
 think it is a bug, because the XML standard also says that whitespaces
 before normal text should be ignored, and I do not see it back as text
 when I read the node, so why preserve it and mess up the formatting in
 the end?

Which part of the standard is this? Here's the XML 1.0 specification's
section on whitespace:

http://www.w3.org/TR/2006/REC-xml-20060816/#sec-white-space

It seems to me that applications (and the libraries which serve them)
can choose what to do unless xml:space is set to preserve. It does
seem odd that the toprettyxml method chooses to respect existing
whitespace whilst also disrupting it by adding more, however.

Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: toprettyxml messes up with whitespaces

2007-10-03 Thread Jorgen Bodde
Hi Paul,

 This seems like a reasonable explanation without having looked at the
 source code myself.

It's by thorough investigation ;-)

 Which part of the standard is this? Here's the XML 1.0 specification's
 section on whitespace:

 http://www.w3.org/TR/2006/REC-xml-20060816/#sec-white-space

Well 2.10 if I quote:

quote
Such white space is typically not intended for inclusion in the
delivered version of the document. On the other hand, significant
white space that should be preserved in the delivered version is
common, for example in poetry and source code.
/quote

I interpret significant whitespaces as the ones between the words,
if whitespaces occur at the beginning of a line due to an indent like

value
 This is indented text
/value

We can assume that the spaces in front of it are not significant
whitespaces. Because when I read the text node in python and it is not
included, I see no reason why it should be preserved. And if it is
preserved in the xml DOM, toprettyxml should first investigate how
many whitespaces are already there before adding more to indent the
text.

Also this happens. First the nodes are properly shown:

value
a ... /a
/value
value
a ... /a
/value

When writing back this sometimes happen (mind the blank lines):

value
a ... /a
/value

value
a ... /a
/value

And the next time, the spaces between the nodes is expanded again:

value
a ... /a
/value


value
a ... /a
/value

(etc) .. so when reading, modifying, writing XML files, the empty
blank lines will grow exponentially.

 It seems to me that applications (and the libraries which serve them)
 can choose what to do unless xml:space is set to preserve. It does
 seem odd that the toprettyxml method chooses to respect existing
 whitespace whilst also disrupting it by adding more, however.

I would think (simplistic I'm sure) that if spaces are that important,
you can always use a CDATA tag which should treat the text inside as
raw data without any formatting and whitespace changes.

Should I file this as a bug to be solved? I have my workaround now,
but I read online that more people seem to have ran into this.

Regards,
- Jorgen
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: toprettyxml messes up with whitespaces

2007-10-03 Thread Jim
On Oct 3, 6:18 am, Jorgen Bodde [EMAIL PROTECTED] wrote:
 Should I file this as a bug to be solved? I have my workaround now,
 but I read online that more people seem to have ran into this.
Perhaps it is not a bug in that it does not violate the standard.  But
I know that I have been annoyed by it any number of times.  I think it
is fair to say that it violates the principle of least surprise.

IMHO actionpThen a shot rang out.\nHe shouted./p/action
should be pretty-printed as

action
  pThen a shot rang out.
He shouted./p
/action

That is, I perceive that the right behavior is to not add white
space to the textual data.

No doubt this is a matter of taste and of intended audience (and maybe
there are complications that I don't see).  But let me urge you to
send the mataintainers something.

Jim Hefferon

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: toprettyxml messes up with whitespaces

2007-10-03 Thread Marc 'BlackJack' Rintsch
On Wed, 03 Oct 2007 12:18:45 +0200, Jorgen Bodde wrote:

 Which part of the standard is this? Here's the XML 1.0 specification's
 section on whitespace:

 http://www.w3.org/TR/2006/REC-xml-20060816/#sec-white-space
 
 Well 2.10 if I quote:
 
 quote
 Such white space is typically not intended for inclusion in the
 delivered version of the document. On the other hand, significant
 white space that should be preserved in the delivered version is
 common, for example in poetry and source code.
 /quote
 
 I interpret significant whitespaces as the ones between the words,
 if whitespaces occur at the beginning of a line due to an indent like

Significant whitespace is all whitespace in nodes that may contain text. 
You need a DTD or schema to decide this, that's why all pretty printing
without a DTD or schema is broken IMHO.  Because you then simply don't
know if it is safe to strip or add whitespace.

 value
  This is indented text
 /value
 
 We can assume that the spaces in front of it are not significant
 whitespaces.

I can't.  You are just guessing.

 Because when I read the text node in python and it is not
 included, I see no reason why it should be preserved.

But it should be included.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: toprettyxml messes up with whitespaces

2007-10-03 Thread Martin v. Löwis
 quote
 Such white space is typically not intended for inclusion in the
 delivered version of the document. On the other hand, significant
 white space that should be preserved in the delivered version is
 common, for example in poetry and source code.
 /quote
 
 I interpret significant whitespaces as the ones between the words,

This interpretation is incorrect. It's not really possible to tell what
whitespace is significant from looking just at the document; the
classification into significant and insignificant is up to the
application, not the XML processor.

There is also the concept of ignorable white space in SAX (and other
APIs); by this, white space in element content is meant. This is
supported by the XML recommendation with the sentence
A  validating XML processor  MUST also inform the application which of
these characters constitute white space appearing in element content.
(you can only know if it's in element content if you validate)

 We can assume that the spaces in front of it are not significant
 whitespaces.

No, we cannot. Maybe your application can assume that; the XML
processor cannot. In fact, the XML recommend FORBIDS the XML processor
from stripping white space.

 (etc) .. so when reading, modifying, writing XML files, the empty
 blank lines will grow exponentially.

Not sure why you keep saying that growth is exponentially; I believe
it's linear (with the number of read-write-cycles), not exponential.

 I would think (simplistic I'm sure) that if spaces are that important,
 you can always use a CDATA tag which should treat the text inside as
 raw data without any formatting and whitespace changes.

That is definitely simplistic. CDATA has no significance on formatting.

 Should I file this as a bug to be solved? I have my workaround now,
 but I read online that more people seem to have ran into this.

Feel free to come up with a patch. It is questionable whether a bug
report will help; there is a good chance that it stays open for several
years.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


toprettyxml messes up with whitespaces

2007-10-02 Thread Jorgen Bodde
Hi all,

I parse an XML file, replace a node with a new one (like updating
cache) and write it back. Every write, new spaces are added. For
example, first read - update - write cycle;

var name=APPNAME status=undefined
 My First App   
/var

Second cycle:

var name=APPNAME status=undefined
 My First App   

/var

Third cycle:

var name=APPNAME status=undefined
   My First App 

/var


And this goes on. The node is one that is not touched in the XML, it
is simply written back after reading. I have the same with void spaces
in between the nodes, I managed to compensate that by stripping the
lines.

I would like to use toprettyxml to make it user editable and viewable.
But this is really weird. How can I circumvent this behaviour?

regards,
- Jorgen
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: toprettyxml messes up with whitespaces

2007-10-02 Thread kyosohma
On Oct 2, 11:43 am, Jorgen Bodde [EMAIL PROTECTED] wrote:
 Hi all,

 I parse an XML file, replace a node with a new one (like updating
 cache) and write it back. Every write, new spaces are added. For
 example, first read - update - write cycle;

 var name=APPNAME status=undefined
  My First App
 /var

 Second cycle:

 var name=APPNAME status=undefined
  My First App
 /var

 Third cycle:

 var name=APPNAME status=undefined
My First App
 /var

 And this goes on. The node is one that is not touched in the XML, it
 is simply written back after reading. I have the same with void spaces
 in between the nodes, I managed to compensate that by stripping the
 lines.

 I would like to use toprettyxml to make it user editable and viewable.
 But this is really weird. How can I circumvent this behaviour?

 regards,
 - Jorgen

I had similar problems and ended up switching to the lxml package to
solve the issue. I think you can do it with ElementTree too. Maybe
somebody with more experience with the xml / minidom modules will show
up soon.

Mike

-- 
http://mail.python.org/mailman/listinfo/python-list