I had to do something like this a while back, for some work I did representing a UML model represented in XMI. I did, as suggested here, import the XML tree into neo4j and then worked on producing sequential domain abstractions of the XML representation, using a set of graph transformers, until I got to something that represented the content of the UML model. I was trying to retain the linkage between all the views so had to keep the various layers in the graph.
>From the XML import point of view I used a DOM walk and hasParent and a hasPrevious relationships to allow for the representation of the serialisation. If you are going to round trip the XML representation you need to have a strategy for handling text nodes gracefully. I had to introduce a special property that allowed a type representation of the XML node being represented. I also had a property on a neo4j graph node which held the name of the xml element. To avoid naming conflicts I had to introduce a system of namespacing neo4j property names, and in fact namespace representation in general was a bit of an issue. Based on my experience of doing it this was I have a couple of observations. One is that to do the abstraction you end up wanting to implement the DOM specification on top of your graph representation so you can access neo4j nodes with salient features directly, particularly local path based querying within the graph was something I had to develop a graph path query for. The other is that I found in most situations there was a set of XML I really wanted to remain collapsed within the graph and represented just as a text attribute in a graph node (e.g. html content in xml structure). So the ability to selectively collapse and expand nodes was useful. I was looking at DocumentTraversers and NodeFilters in the DOM 2 spec and thought they might have been the best way to do this. I don;t really understand it but something I came across later in my travels was w3c's GRDDL which appears to be a spec for binding XML transforms to create RDF. There seem to be other more generic ways also of doing the XML to RDF conversion - these could be loaded into a dense triple store in Neo4j. Maybe that is one simple approach which avoids re-implementing namespace support. Craig mentions relationships by identity, there are also relationships by xpath to contend with. I did lookup by identity using an indexservice in the graph as a postprocessing step but it is more or less the same approach. I didn;t implement xpath references. I'm happy to dig out the code if anyone wants a look but be warned its a bit of a mess. Rob. On Tue, Dec 1, 2009 at 6:37 PM, Craig Taverner <cr...@amanzi.com> wrote: > I can elaborate on my view of option one, parsing directly into neo4j with > no in-memory representation (other than neo4j caches). > > Firstly I avoid the use of the word DOM here because that seems to imply > the > use of an in-memory representation of the XML, which is unnecessary if the > graph structure is going to match the original XML. > > I would use a SAX parser, which with the default configuration would build > a > tree structure in Neo4j exactly matching the XML. Each tag would be a node, > all attributes become properties and all sub-tags become nodes related to > the encapsulating node with a 'CHILD' relationship. Then, as Peter says, > the > graph database can be manipulated later. > > However, it should not be too hard to create an extension framework > allowing > specific customization of the parser to achieve a few deviations from the > 1-1 mapping of XML to graph: > > - Filters for ignoring XML structure > - Mappers for converting attributes to child nodes, or contained tags to > properties > - Rules for cross linking nodes, so the structure is no longer a tree but > a closed graph. > > This last one is especially important for domain data that XML does not > support nicely. For example, consider the case of a dataset describing > blogs. There might be a tree structure of authors like: > <authors><author><blog>... In addition there might be a separate tree of > categories or tags: <tags><tag id="32">politics</tag></tags>. In XML the > cross linking is achieved though sub-tags to the <blog>, as in <tag-ref > id="32"/>. Our default SAX parser would simply create a node linked to the > blog with the id as an attribute. It has no idea that a better approach is > to make a relationship back to the original tag in the tags tree. > > We can add that structure later, but I think it best to add it during the > parse (as long as tags are defined before blogs). To make this work, we do > need an in memory hash of the tag ids, but that is not a big cost compared > to the total XML data size, so we still get the low memory, and high > performance advantages of this approach. > > On Tue, Dec 1, 2009 at 6:13 PM, Peter Neubauer < > peter.neuba...@neotechnology.com> wrote: > > > Hi folks, > > I have come over the problem of importing data form XML files a lot > > lately. There are 2 approaches that emerged after discussions with > > Craig Taverner: > > > > 1. Write a generic utility: take the XML DOM tree and directly put it > > as-is into Neo4j, then later write code to connect the interesting > > nodes to your domain, maybe discard the DOM tree after that. > > > > 2. Write a specific domain parser: Filter out interesting info with > > e.g. XPath, then create only the relevant information as a graph in > > Neo4j. > > > > Now, 1) sounds like a generic import utility that would be very handy. > > OTOH, I wonder how common the problem is and what you guys would > > prefer, 1) or 2)? > > > > -- > > > > Cheers, > > > > /peter neubauer > > > > COO and Sales, Neo Technology > > > > GTalk: neubauer.peter > > Skype peter.neubauer > > Phone +46 704 106975 > > LinkedIn http://www.linkedin.com/in/neubauer > > Twitter http://twitter.com/peterneubauer > > > > http://www.neo4j.org - Relationships count. > > http://www.linkedprocess.org - Distributed computing on LinkedData > scale > > _______________________________________________ > > Neo mailing list > > User@lists.neo4j.org > > https://lists.neo4j.org/mailman/listinfo/user > > > _______________________________________________ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user