Re: [RFC] Roundtripping namespaced xml documents for data.xml

Herwig Hochleitner Mon, 26 May 2014 17:06:08 -0700

2014-05-26 22:46 GMT+02:00 Paul Gearon <gea...@gmail.com>:

> Hi Herwig,
>
> First, I have to start with an apology,
>


Hi Paul,

it's alright. I have to admit, that I'm relieved that you sent that in
error.

My point of view is that processing real-world XML rarely needs the fully
> resolved URIs.
>

Can we agree, that an application doing "namespace aware" processing,
actually _does_ care about the URI? Of course, as long as no additional
xmlns attrs are introduced, it might compare by prefix, but it's still the
target uri of the prefix, that counts.


>  Instead, most applications only care about the element names as they
> appear in the document. Also, keywords have the nice property of being
> interned, which matters when parsing 20GB XML files.
>

It's worthwhile to optimize for large files, but correctness considerations
must always come first. We don't want to encourage "mostly correct" xml
processing.

It's possible to intern a QName implementation, but they will still use
> more memory per element.
>

I don't buy that argument. Have you looked at the java types and done the
math? Also, if it's really significant, we can always use a custom deftype.

The counter argument is that the standard requires support for handling
> fully resolved QNames, so these need to be dealt with. However, I don't
> believe that the use-cases for dealing with fully resolved data comes up as
> often. Also, when it does come up, my experience has been that it is
> usually in smaller documents (<1MB), where efficiency of comparison is not
> as paramount.
>
> The issue here may be more of dealing with the representation tier vs. the
> model tier. I will address that question more at the bottom of this email.
>

In my view, if we were to make data.xml namespace aware, it needs to
actually implement the standard. Users can always stay in representation
tier if they don't like the overhead that comes with resolving. And yes, I
think the distinction between representation and model tier is critical.

OK, I agree. My difference has been that I don't think that the entire URI
> need exist in the element tag, but rather allow it to be built from the
> element tag + the namespace context. (That would be the
> representation-to-model tier mapping. I mention this, along with namespace
> contexts at the end)
>

As I hinted in my reply, I actually started off in the same direction: Keep
everything in representation tier and just give the user tools to resolve
the prefixes properly. You can actually follow the process in the dev
thread of how I got convinced that for default xml handling it's better to
add a model tier.

This is why I have advocated attaching the namespace context as metadata.
>

My current implementation of representation tier actually has that. It's
still no substitute for working with fully resolved data. Just think of
what people might do to a parsed xml tree with clojure's core functions.

 What about the QName {http://www.w3.org/1999/xhtml}body? Notice that :
>> http://www.w3.org/1999/xhtml/body would be read like (keyword "http:" "/
>> www.w3.org/1999/xhtml/body"). Another point that's already been made on
>> the dev thread.
>>
>
> Not sure what you're trying to get at with this example.
>

What I might have misunderstood, is that I thought you argued for cgrand
and chousers original approach of putting the namespace uri into the
keyword namespace.
All I'm trying to say is that keywords are inappropriate for storing
_resolved QNames_. They are, however, appropriate for storing prefixed
names in representation tier.

The syntax {http://www.w3.org/1999/xhtml}body is a universal name in
> Clark's notation, and it's used for describing a resolved QName.  As Clark
> points out, it's not valid to use a universal name in XML: you only use it
> in the data model.
>

Yes, I was talking about the data model and how to encode it in clojure
data structures. I never suggested using Clark's notation in actual
documents.


> In this case, the QName would presumably be either just "body" with the
> default namespace, or xhtml:body with an in-context namespace of
> xmlns:xhtml="http://www.w3.org/1999/xhtml<http://www.w3.org/1999/xhtml%7Dbody>
> ".
>

I have to admit that I had gotten this part of terminology wrong in my
mental model. I thought that QName always referred to a universal name, the
way java's QName implementation does. I just learned that in the standard
"qualified name" refers to a possibly prefixed tag or attr name within the
serialization.

Consequently, the keyword to be constructed should look like either
> (keyword ""body") or (keyword "xthml" "body"). Somewhere nearby there will
> be metadata of {:xhtml 
> "http://www.w3.org/1999/xhtml<http://www.w3.org/1999/xhtml%7Dbody>
> "}
>

OK, that's representation tier. We still need to recognize the fact, that
the metadata just might not be there, even if we always generate it in the
parser.

As an aside, I was curious, so I tried both full URIs and universal names
> in a couple of XML validators, and was surprised to see that they
> validated. I've no idea why. The W3C validators reject them (as they are
> supposed to).
>

Are you talking about Clark syntax in xml documents?

I agree that the QName format is a serialization artifact. The data model
> is really all URIs.
>
That said, I've used URIs as the data model, and they get in the way. They
> are not especially useful to work with, slower, and take more memory (those
> last two are because they aren't usually interned, they're validated, and
> they have numerous internal fields).
>

Java's universal name implementation (javax.xml.namespace.QName) seems to
store the URI as a string, so this concern doesn't apply.


>  My admittedly limited experience is that elements are almost never
> accessed as URIs, but by name.
>

Well, we want to make it easy to do correct xml processing, don't we? I
don't want to work with actual URIs in my code. What I want to do is write
down the URI as a constant and use it to compare against when navigating
the parsed xml.


> The fact that data.xml hasn't supported namespaces before now is an
> example of how little people use URIs from XML.
>

That's a logical fallacy. The fact that people much more likely dismiss an
implementation based on its lacking capability rather than trying to
improve it, creates a massive selection bias. And still here we are,
hammering out the design for an improvement.

When processing large amounts of data, I'd much rather deal with a stream
> of keyword-based elements that provide enough context to build the URI than
> with the full URI for every element that I have to convert to keywords.
> I'll get to this when I talk about the tiers below.
>

OK, then representation tier is for you.

Actually, my plan is to allow omitting XmlNamespace context from
representation tier, so that you can do massive processing even more
efficiently.

 Which is the reason we need to lift elements out of their context as soon
>> as possible. We don't want an element to change its namespace, just because
>> we transplant it into another xml fragment. Chouser went to great length
>> about this point, before he realized that this was exactly my goal aswell.
>>
>
> Well, I *am* talking about attaching the context to the elements, so the
> data is there.
>

Again, the context can get lost when people rebuild elements, or create
completely new ones.

 My plan is to only store the metadata. The set of namespaces is implicitly
>> given by QNames contained within the fragment and early introduction of
>> nessecary xmlns declarations can be achived by diffing the metadata. See my
>> design document.
>>
>> Note: I'm talking about the new representation here. The current one will
>> continue to work unchanged.
>>
>
> Do you mean the "current data.xml" that doesn't support namespaces, or the
> "current one" meaning the code that you have in your github repo?
>

Neither. I mean the current implementation + a minimal roundtripping patch.
This will be called representation tier (modulo maintaining XmlNamespace
env)

I don't like mutation myself. I did it for speed of implementation at the
> time (and laziness as it borrowed from a SAX parser). I'm not suggesting my
> code as a complete alternative (e.g. there are still missing parts), but to
> present a different approach.
>

> The reason for that approach was that it was the fastest way (for me) to
> create a stack that runs in parallel to the parser or emitter that contains
> the context that the parser/emitter is not showing you.
>
> Typically, I would pass the current value of the stack into the code that
> handles the current element, and the return value from the handler would be
> a tuple of both the returned data and the new stack value. I'll do that in
> the next day or so if it'll get it looked at.
>
> While I'm at it, I should break it up into namespaces based on
> functionality, as you have.
>

I suspect that our approaches are not that different. One of the first
possibilities I came up with, was also to hack it in with an atom. I
decided to strive for production quality right away, so I reworked the
parser to thread the namespace context down the stack by arguments.

I guess I've been uncomfortable with the 2-tier model, which has influenced
> my discussion to date. While the model tier allows for elements like the
> deep-equals operation, I haven't run into any uses which would require it
> (there are probably others in the Clojure community who have). So my bias
> has been to treat that tier as a transformation built from the
> representational tier on an as-needed basis.
>

My use case is parsing and generating webdav and I'd much rather work with
model tier than just looking at tag names and assuming that the prefixes
are set up correctly. Yes, that might exclude clients, that generate
invalid webdav. Yes, I think that's a feature.

Building up from representation tier on the fly can be done, but it won't
be as efficient as parsing to model tier directly, because of additional
allocation. This also applies to your model.

Also, my initial reading of the proposal was that the representational tier
> was as a stepping point to getting to the real data to be returned, which
> was to be the model tier. Perhaps I was mistaken, and no such emphasis is
> implied.
>

I would say so. I actually invested a lot of energy to ensure that we
retain a proper representation tier. I think that's critical for those
poor-soul enterprise devs, that actually need to generate invalid webdav in
order to talk to some broken system. Again, see the dev thread.

One reason for believing that the representation tier is simply a
> transitional step is because it treats namespace declarations for elements
> as simple attributes.
>

My thinking was: If we have a representation tier, it should store what's
actually written in the document. I will reevaluate this based on my
mistake on QName terminology (thanks for pointing it out).


> They may be represented syntactically in this way, but the semantics are
> already established by the time this tier is called
>

Conceptually yes. After all, well-formedness requires the appropriate
xmlns: declarations to be present. However, that doesn't help when a user
lifts a fragment (or even just a tag name), expecting everything to work.

 (you're still expecting to use the Java StAX parser, right?).
>

Not nessecarily. I expect to use it for parsing and emitting the
angle-braced serialization of xml we all know and love(?). Representation
tier is another target, that will be served by lazy tree-walkers. The
purpose of creating those is, that we can define new transformations,
without going through StAX.

 Also, since I'm happy with keywords, if I don't want to go to the effort
> of handling everything in the model tier, then that will put a lot of work
> on me to deal with namespaces that I ought not to be worried about.
>

Don't worry, representation tier will also add context metadata. Also,
XmlNamespaceImpl would make the task of maintaining it yourself rather
trivial.

 The data model I was proposing (and is generated with my version of the
> parser... regardless of how poorly I implemented the stack) would map to
> the representational tier, providing all the semantics of the model tier
> without the full resolution. That is: keywords for tags; a namespace map
> for the declarations made on the element; a namespace map in metadata for
> the namespace context inherited from the parent element.
>

Yes, the more I read, the more I get the impression that you are
implementing a variant of representation tier.

I see how you would carry over the distinction between regular attributes
and xmlns attributes from StAX, however, I think that for representation
tier, it's appropriate to stick to the data format with the most
straightforward mapping to actual bytes, namely the way it's currently done
modulo correct roundtripping.

I'm stuck on a project until I have a namespace-aware data.xml, so I might
> as well try it as well. In my case I'll need to fix up the implementation
> of the namespace stack, and write a transformation function that converts
> the tree to a fully resolved one, and back. You may still hate it, but at
> least it will show what I'm talking about. :)
>

Fun! That was my motivation aswell (the WebDAV project). Also the function
for converting to and from fully resolved sounds exactly like what I have
in mind with the lazy tree walkers.

Well, nothing like alternative implementations to encourage refining ideas.
I hope that we can join forces after we finish exploring ideas and before
we push for a new data.xml release.

Regarding your concerns about memory efficiency and interning:
So the namespace uri should automatically get interned by StAX, since it
wouldn't make any sense to clone the string everytime you call
getNamespaceURI.
Not interning tag names and QName instances will slightly increase GC
pressure, but it shouldn't add to memory usage significantly, since you
would most certainly process a 20GB document lazily, no? It might still
make sense to intern them ... we need benchmarks.

best regards

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [RFC] Roundtripping namespaced xml documents for data.xml

Reply via email to