I had to do something like this a while back, for some work I did
representing a UML model represented in XMI. I did, as suggested here,
import the XML tree into neo4j and then worked on producing sequential
domain abstractions of the XML representation, using a set of graph
transformers, until I got to something that represented the content of the
UML model. I was trying to retain the linkage between all the views so had
to keep the various layers in the graph.

>From the XML import point of view I used a DOM walk and hasParent and a
hasPrevious relationships to allow for the representation of the
serialisation. If you are going to round trip the XML representation you
need to have a strategy for handling text nodes gracefully. I had to
introduce a special property that allowed a type representation of the XML
node being represented. I also had a property on a neo4j graph node which
held the name of the xml element. To avoid naming conflicts I had to
introduce a system of namespacing neo4j property names, and in fact
namespace representation in general was a bit of an issue.

Based on my experience of doing it this was I have a couple of observations.
One is that to do the abstraction you end up wanting to implement the DOM
specification on top of your graph representation so you can access neo4j
nodes with salient features directly, particularly local path based querying
within the graph was something I had to develop a graph path query for. The
other is that I found in most situations there was a set of XML I really
wanted to remain collapsed within the graph and represented just as a text
attribute in a graph node (e.g. html content in xml structure). So the
ability to selectively collapse and expand nodes was useful. I was looking
at DocumentTraversers and NodeFilters in the DOM 2 spec and thought they
might have been the best way to do this.

I don;t really understand it but something I came across later in my travels
was w3c's GRDDL which appears to be a spec for binding XML transforms to
create RDF. There seem to be other more generic ways also of doing the XML
to RDF conversion - these could be loaded into a dense triple store in
Neo4j. Maybe that is one simple approach which avoids re-implementing
namespace support.

Craig mentions relationships by identity, there are also relationships by
xpath to contend with. I did lookup by identity using an indexservice in the
graph as a postprocessing step but it is more or less the same approach. I
didn;t implement xpath references. I'm happy to dig out the code if anyone
wants a look but be warned its a bit of a mess.

Rob.

On Tue, Dec 1, 2009 at 6:37 PM, Craig Taverner <cr...@amanzi.com> wrote:

> I can elaborate on my view of option one, parsing directly into neo4j with
> no in-memory representation (other than neo4j caches).
>
> Firstly I avoid the use of the word DOM here because that seems to imply
> the
> use of an in-memory representation of the XML, which is unnecessary if the
> graph structure is going to match the original XML.
>
> I would use a SAX parser, which with the default configuration would build
> a
> tree structure in Neo4j exactly matching the XML. Each tag would be a node,
> all attributes become properties and all sub-tags become nodes related to
> the encapsulating node with a 'CHILD' relationship. Then, as Peter says,
> the
> graph database can be manipulated later.
>
> However, it should not be too hard to create an extension framework
> allowing
> specific customization of the parser to achieve a few deviations from the
> 1-1 mapping of XML to graph:
>
>   - Filters for ignoring XML structure
>   - Mappers for converting attributes to child nodes, or contained tags to
>   properties
>   - Rules for cross linking nodes, so the structure is no longer a tree but
>   a closed graph.
>
> This last one is especially important for domain data that XML does not
> support nicely. For example, consider the case of a dataset describing
> blogs. There might be a tree structure of authors like:
> <authors><author><blog>... In addition there might be a separate tree of
> categories or tags: <tags><tag id="32">politics</tag></tags>. In XML the
> cross linking is achieved though sub-tags to the <blog>, as in <tag-ref
> id="32"/>. Our default SAX parser would simply create a node linked to the
> blog with the id as an attribute. It has no idea that a better approach is
> to make a relationship back to the original tag in the tags tree.
>
> We can add that structure later, but I think it best to add it during the
> parse (as long as tags are defined before blogs). To make this work, we do
> need an in memory hash of the tag ids, but that is not a big cost compared
> to the total XML data size, so we still get the low memory, and high
> performance advantages of this approach.
>
> On Tue, Dec 1, 2009 at 6:13 PM, Peter Neubauer <
> peter.neuba...@neotechnology.com> wrote:
>
> >  Hi folks,
> > I have come over the problem of importing data form XML files a lot
> > lately. There are 2 approaches that emerged after discussions with
> > Craig Taverner:
> >
> > 1. Write a generic utility: take the XML DOM tree and directly put it
> > as-is into Neo4j, then later write code to connect the interesting
> > nodes to your domain, maybe discard the DOM tree after that.
> >
> > 2. Write a specific domain parser: Filter out interesting info with
> > e.g. XPath, then create only the relevant information as a graph in
> > Neo4j.
> >
> > Now, 1) sounds like a generic import utility that would be very handy.
> > OTOH, I wonder how common the problem is and what you guys would
> > prefer, 1) or 2)?
> >
> > --
> >
> > Cheers,
> >
> > /peter neubauer
> >
> > COO and Sales, Neo Technology
> >
> > GTalk:      neubauer.peter
> > Skype       peter.neubauer
> > Phone       +46 704 106975
> > LinkedIn   http://www.linkedin.com/in/neubauer
> > Twitter      http://twitter.com/peterneubauer
> >
> > http://www.neo4j.org                - Relationships count.
> > http://www.linkedprocess.org   - Distributed computing on LinkedData
> scale
> > _______________________________________________
> > Neo mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to