I can elaborate on my view of option one, parsing directly into neo4j with no in-memory representation (other than neo4j caches).
Firstly I avoid the use of the word DOM here because that seems to imply the use of an in-memory representation of the XML, which is unnecessary if the graph structure is going to match the original XML. I would use a SAX parser, which with the default configuration would build a tree structure in Neo4j exactly matching the XML. Each tag would be a node, all attributes become properties and all sub-tags become nodes related to the encapsulating node with a 'CHILD' relationship. Then, as Peter says, the graph database can be manipulated later. However, it should not be too hard to create an extension framework allowing specific customization of the parser to achieve a few deviations from the 1-1 mapping of XML to graph: - Filters for ignoring XML structure - Mappers for converting attributes to child nodes, or contained tags to properties - Rules for cross linking nodes, so the structure is no longer a tree but a closed graph. This last one is especially important for domain data that XML does not support nicely. For example, consider the case of a dataset describing blogs. There might be a tree structure of authors like: <authors><author><blog>... In addition there might be a separate tree of categories or tags: <tags><tag id="32">politics</tag></tags>. In XML the cross linking is achieved though sub-tags to the <blog>, as in <tag-ref id="32"/>. Our default SAX parser would simply create a node linked to the blog with the id as an attribute. It has no idea that a better approach is to make a relationship back to the original tag in the tags tree. We can add that structure later, but I think it best to add it during the parse (as long as tags are defined before blogs). To make this work, we do need an in memory hash of the tag ids, but that is not a big cost compared to the total XML data size, so we still get the low memory, and high performance advantages of this approach. On Tue, Dec 1, 2009 at 6:13 PM, Peter Neubauer < peter.neuba...@neotechnology.com> wrote: > Hi folks, > I have come over the problem of importing data form XML files a lot > lately. There are 2 approaches that emerged after discussions with > Craig Taverner: > > 1. Write a generic utility: take the XML DOM tree and directly put it > as-is into Neo4j, then later write code to connect the interesting > nodes to your domain, maybe discard the DOM tree after that. > > 2. Write a specific domain parser: Filter out interesting info with > e.g. XPath, then create only the relevant information as a graph in > Neo4j. > > Now, 1) sounds like a generic import utility that would be very handy. > OTOH, I wonder how common the problem is and what you guys would > prefer, 1) or 2)? > > -- > > Cheers, > > /peter neubauer > > COO and Sales, Neo Technology > > GTalk: neubauer.peter > Skype peter.neubauer > Phone +46 704 106975 > LinkedIn http://www.linkedin.com/in/neubauer > Twitter http://twitter.com/peterneubauer > > http://www.neo4j.org - Relationships count. > http://www.linkedprocess.org - Distributed computing on LinkedData scale > _______________________________________________ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user