Round-tripping again

Alex Rosen Wed, 18 Sep 2002 15:30:28 -0700

A few weeks ago I e-mailed this list, asking about adding round-tripping
support to Xerces - i.e. the ability to output the exact same XML file as
was read in, or at least very close to it. In other words, preserving more
of the non-infoset information that normally gets dropped.


I spent some time working on this, and have a prototype done, which uses
Augmentations to pass in more information about the "raw text" of the
original document than Xerces normally gives. An example is the amount of
whitespace between attributes. Saving this extra information (and using it
on output) means that if the user puts each attribute on its own line, that
will be preserved on output, instead of collapsing them back onto one line.
These sorts of modifications are semantically equivalent, but it really
annoys users when you reformat their document out from under them.

The particular project that needs this is a dom4j project, so I also created
a special dom4j reader that takes this extra information that's given by the
parser and stores it in each dom4j node it creates, and a writer that uses
this saved information to write out a more accurate version of the output
document. (This could easily be extended to DOM and JDOM.) I've attached an
example. Sample.xml is the source file, rt-output.xml is the output using
the new round-trip-enabled Xerces/dom4j code, and the other two are the
output using standard Xerces/dom4j (in both standard and pretty-printing
modes). Not everything is identical, but it's much, much better.

I think it would be nice if this feature were added to Xerces. I think it
fulfills a significant need, and I don't think it adds any overhead when
it's not turned on, and probably minimal overhead with it turned on. It
currently doesn't cover many of the less-used areas of XML (notations, etc.)
but I think it does a very good job of covering the common cases.

There also happened to be a similar thread going on at the same time as my
original post, that I'd like to respond to:

http://marc.theaimsgroup.com/?l=xerces-j-dev&m=103029884901546&w=2

> I can understand the cases in which people would like to
> be able to do this but I also realize what it would take
> to implement it. ;)

I don't the the implementation is too bad. It's not trivial, but not
unreasonably complex, I don't think.

> The "limited usefulness" that I was referring to was the
> fact that reporting character offsets only works if the
> parsed source is already a character stream. If it's
> anything else (say a byte stream in UTF8 or Shift_JIS)
> then the application can't map those offsets back to the
> source without re-reading the file.

But there's *always* a character stream (Reader). Xerces creates one if it's
not handed one. The easy way is to have Xerces send the actual text along to
the user. (The other way is to have the user override createReader() to get
his hands on the relevent character stream, which turns out to be a little
ugly, but works fine.) Thus it's always applicable, even when you hand
Xerces an InputStream. And I think it would be useful to a significant
number of users.

So is there any chance of this modification making it in to Xerces? I'd be
happy to send a patch once it's cleaned up a bit.

Thanks,
Alex

<?xml   version="1.0"   encoding="ISO-8859-1"   standalone="no"?>
<!DOCTYPE 
      parent 
      SYSTEM 'test.dtd'
[
  <!--comment-->  <!--comment-->
  <!ELEMENT A (B)*>
]>
<parent  >
  <!--comment-->  <!--comment-->
  <child 
      a     = "a b [&lt;] [&apos;] [&#10;] [&#x00F4;]"
      b     = ' " '
      c     = " ' "
  />
  <ns2:child2
      xmlns:ns1 = 'ns1'
      xmlns:ns2 = "ns2"
      a         = "a b [&lt;] [&apos;] [&#10;] [&#x00F4;]"
      ns1:b     = ' " '
      ns2:c     = " ' "
  />
  <text> a b [&lt;] [&apos;] [&#10;] [&#x00F4;] </text>
  <x/>
  <x2></x2>
  <x3 />
  <x4 ></x4      >
</parent      >

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<!DOCTYPE 
      parent 
      SYSTEM 'test.dtd'
[
<!--comment-->
<!--comment-->
<!ELEMENT A (B)*>
]>
<parent  >
  <!--comment-->  <!--comment-->
  <child 
      a     = "a b [&lt;] [&apos;] [&#10;] [&#x00F4;]"
      b     = ' " '
      c     = " ' "
  />
  <ns2:child2
      xmlns:ns1 = 'ns1'
      xmlns:ns2 = "ns2"
      a         = "a b [&lt;] [&apos;] [&#10;] [&#x00F4;]"
      ns1:b     = ' " '
      ns2:c     = " ' "
  />
  <text> a b [&lt;] [&apos;] [&#10;] [&#x00F4;] </text>
  <x/>
  <x2></x2>
  <x3 />
  <x4 ></x4      >
</parent      >

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE parent SYSTEM "test.dtd">

<parent>
  <!--comment-->
  <!--comment-->

  <child a="a b [&lt;] [&apos;] [
] [ô]" b=" &quot; " c=" &apos; " x="yes"/>
  <ns2:child2 xmlns:ns2="ns2" xmlns:ns1="ns1" a="a b [&lt;] [&apos;] [
] [ô]" ns1:b=" &quot; " ns2:c=" &apos; "></ns2:child2>
  <text>a b [</text>
  <x/>
  <x2/>
  <x3/>
  <x4/>
</parent>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE parent SYSTEM "test.dtd"><parent><!--comment--><!--comment--><child a="a b [&lt;] [&apos;] [
] [ô]" b=" &quot; " c=" &apos; " x="yes"/><ns2:child2 xmlns:ns2="ns2" xmlns:ns1="ns1" a="a b [&lt;] [&apos;] [
] [ô]" ns1:b=" &quot; " ns2:c=" &apos; "></ns2:child2><text> a b [&lt;] ['] [
] [ô] </text><x/><x2/><x3/><x4/></parent>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Round-tripping again

Reply via email to