I can elaborate on my view of option one, parsing directly into neo4j with
no in-memory representation (other than neo4j caches).

Firstly I avoid the use of the word DOM here because that seems to imply the
use of an in-memory representation of the XML, which is unnecessary if the
graph structure is going to match the original XML.

I would use a SAX parser, which with the default configuration would build a
tree structure in Neo4j exactly matching the XML. Each tag would be a node,
all attributes become properties and all sub-tags become nodes related to
the encapsulating node with a 'CHILD' relationship. Then, as Peter says, the
graph database can be manipulated later.

However, it should not be too hard to create an extension framework allowing
specific customization of the parser to achieve a few deviations from the
1-1 mapping of XML to graph:

   - Filters for ignoring XML structure
   - Mappers for converting attributes to child nodes, or contained tags to
   properties
   - Rules for cross linking nodes, so the structure is no longer a tree but
   a closed graph.

This last one is especially important for domain data that XML does not
support nicely. For example, consider the case of a dataset describing
blogs. There might be a tree structure of authors like:
<authors><author><blog>... In addition there might be a separate tree of
categories or tags: <tags><tag id="32">politics</tag></tags>. In XML the
cross linking is achieved though sub-tags to the <blog>, as in <tag-ref
id="32"/>. Our default SAX parser would simply create a node linked to the
blog with the id as an attribute. It has no idea that a better approach is
to make a relationship back to the original tag in the tags tree.

We can add that structure later, but I think it best to add it during the
parse (as long as tags are defined before blogs). To make this work, we do
need an in memory hash of the tag ids, but that is not a big cost compared
to the total XML data size, so we still get the low memory, and high
performance advantages of this approach.

On Tue, Dec 1, 2009 at 6:13 PM, Peter Neubauer <
peter.neuba...@neotechnology.com> wrote:

>  Hi folks,
> I have come over the problem of importing data form XML files a lot
> lately. There are 2 approaches that emerged after discussions with
> Craig Taverner:
>
> 1. Write a generic utility: take the XML DOM tree and directly put it
> as-is into Neo4j, then later write code to connect the interesting
> nodes to your domain, maybe discard the DOM tree after that.
>
> 2. Write a specific domain parser: Filter out interesting info with
> e.g. XPath, then create only the relevant information as a graph in
> Neo4j.
>
> Now, 1) sounds like a generic import utility that would be very handy.
> OTOH, I wonder how common the problem is and what you guys would
> prefer, 1) or 2)?
>
> --
>
> Cheers,
>
> /peter neubauer
>
> COO and Sales, Neo Technology
>
> GTalk:      neubauer.peter
> Skype       peter.neubauer
> Phone       +46 704 106975
> LinkedIn   http://www.linkedin.com/in/neubauer
> Twitter      http://twitter.com/peterneubauer
>
> http://www.neo4j.org                - Relationships count.
> http://www.linkedprocess.org   - Distributed computing on LinkedData scale
> _______________________________________________
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to