Re: Graph database support w/ NiFi

Mike Thomsen Fri, 05 Oct 2018 15:16:13 -0700

Uwe and Matt,

Now that we're dipping our toes into Neo4J and Cypher, any thoughts on this?


https://github.com/opencypher/cypher-for-gremlin

I'm wondering if we shouldn't work with mans2singh to take the Neo4J work
and push it further into having a client API that can let us inject a
service that uses that or one that uses Neo4J's drivers.

Mike

On Mon, May 14, 2018 at 7:13 AM Otto Fowler <ottobackwa...@gmail.com> wrote:

> The wiki discussion should list these and other points of concern and
> should document the extent to which
> they are to be addressed.
>
>
> On May 12, 2018 at 12:37:59, u...@moosheimer.com (u...@moosheimer.com)
> wrote:
>
> Matt,
>
> You have some interesting ideas that I really like.
> GraphReaders and GraphWriters would be interesting. When I started
> writing a graph processor with my idea, the concept was not yet
> implemented in NiFi.
> I don't find GraphML and GraphSON so tingly because they contain e.g.
> the Vertex/Edge IDs and serve as import and export format to my
> knowledge (correct me if I'm wrong).
>
> A ConvertRecordToGraph processor is a good approach, the only question
> is from which format we can convert?
>
> I also think to make a graph processor a bit general we would have to
> provide a query as input which provides the correct vertex from which
> the graph should be extended.
> Maybe like your suggestion with a gremlin query or a small gremlin script.
>
> If a vertex is found a new edge and a new vertex are added.
> It asks how we transmit the individual attributes to the edge and vertex
> as well as the labels of the edge and vertex? Possibly with NiFi
> attributes?
>
> I have some headaches about the complexity.
> A small example:
> Imagine we have a set from a CSV file.
> The columns are Set ID, Token1, Token2, Token3...
> ID, Token1,Token2,Token3,Token4,Token5
> 123, Mary, had, a, little, lamp
>
> I want to create a vertex with ID 123 (if not exists). Then I want to
> check for each token if a vertex exists in the graph database (search
> for vertex with label "Token" and attribute "name"="Mary"). If the
> vertex does not exist, the vertex has to be created.
> Since I want to save e.g. Wikipedia to my graph I want to avoid the
> supernode problem for the token vertices. I create a few distribution
> vertices for each vertex that belongs to a token. If there is a vertex
> for Token1(Mary) then I don't want to make the edge from this vertex to
> my vertex with the ID 123, but from one of the distribution vertices.
> If the vertex for the token does not exist, the distribution vertices
> have also to be created ... and so on...
>
> Even with this very simple example it seems to become difficult with a
> universal processor.
>
> In any case I think the idea to implement a graph processor in NiFi is a
> good one.
> The more we work on it the more good ideas we get and maybe only I can't
> see the forest for the trees.
>
> One question about Titan. To my knowledge, Titan has been dead for a
> year and a half and Janusgraph is the successor?
> Titan has become unofficially Datastax Enterprise Graph?!
> Supporting Titan could become difficult because Titan does not support
> my knowledge after TinkerPop 3 and is no longer maintained.
>
> I like your idea for a wiki page for more ideas. In the many mails one
> loses oneself otherwise.
>
> Regards,
> Kay-Uwe
>
> Am 12.05.2018 um 16:52 schrieb Matt Burgess:
> > All,
> >
> > As Joe implied, I'm very happy that we are discussing graph tech in
> > relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
> > mine. Mike, the examples you list are great, I would add Titan (and
> > its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
> > and others are at [1]). I think there are at least four aspects to
> > this:
> >
> > 1) Graph query/traversal: This deals with getting data out of a graph
> > database and into flow file(s) for further processing. Here I agree
> > with Kay-Uwe that we should consider Apache Tinkerpop as the main
> > library for graph query/traversal, for a few reasons. The first as
> > Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
> > connect to various databases, from Mike's list I believe ArangoDB is
> > the only one that does not yet have a TP adapter. The second is
> > informed by the first, TP is a standard interface and graph traversal
> > engine with a common DSL in Gremlin. A third is that Gremlin is a
> > Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
> > and you can call Groovy/Gremlin from Java and vice versa. A third is
> > that Tinkerpop is an Apache TLP with a very active and vibrant
> > community, so we will be able to reap the benefits of all the graph
> > goodness they develop moving forward. I think a QueryGraph processor
> > could be appropriate, perhaps with a GraphDBConnectionPool controller
> > service or something of the like. Apache DBCP can't do the pooling for
> > us, but we could implement something similar to that for pooling TP
> > connections.
> >
> > 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
> > a graph traversal language, and although its API has addVertex() and
> > addEdge() methods and such, it seems like an inefficient solution,
> > akin to using individual INSERTs in an RDBMS rather than a
> > PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
> > RDBMSs is usually specific to that DB, and the same goes for graphs.
> > The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
> > external tools (not sure if there's a Java API or not) and Cypher,
> > OrientDB has an ETL pipeline system, etc. If we have a standard Graph
> > concept, we could have controller services / writers that are
> > system-specific (see aspect #4).
> >
> > 3) Arbitrary data -> Graph: Converting non-graph data into a graph
> > almost always takes domain knowledge, which NiFi itself won't have and
> > will thus have to be provided by the user. We'd need to make it as
> > simple as possible but also as powerful and flexible as possible in
> > order to get the most value. We can investigate how each of the
> > systems in aspect #2 approaches this, and perhaps come up with a good
> > user experience around it.
> >
> > 4) Organization and implementation: I think we should make sure to
> > keep the capabilities very loosely coupled in terms of which
> > modules/NARs/JARs provide which capabilities, to allow for maximum
> > flexibility and ease of future development. I would prefer an
> > API/libraries module akin to nifi-hadoop-libraries-nar, which would
> > only include Apache Tinkerpop and any dependencies needed to do "pure"
> > graph stuff, so probably no TP adapters except tinkergraph (and/or its
> > faster fork from ShiftLeft [2]). The reason I say that is so NiFi
> > components (and even the framework!) could use graphs in a lightweight
> > manner, without lots of heavy and possibly unnecessary dependencies.
> > Imagine being able to query your own flows using Gremlin or Cypher! I
> > also envision an API much like the Record API in NiFi but for graphs,
> > so we'd have GraphReaders and GraphWriters perhaps, they could convert
> > from GraphML to GraphSON or Kryo for example, or in conjunction with a
> > ConvertRecordToGraph processor, could be used to support the
> > capability in aspect #3 above. I'd also be looking at bringing in
> > Gremlin to the scripting processors, or having a Gremlin based
> > scripting bundle as NiFi's graph capabilities mature.
> >
> > You might be able to tell I'm excited about this discussion ;) Should
> > we get a Wiki page going for ideas, and/or keep it going here, or
> > something else? I'm all ears for thoughts, questions, and ideas
> > (especially the ones that might seem crazy!)
> >
> > Regards,
> > Matt
> >
> > [1] http://tinkerpop.apache.org/providers.html
> > [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin
> >
> > On Sat, May 12, 2018 at 8:02 AM, u...@moosheimer.com <u...@moosheimer.com>
> wrote:
> >> Hi Mike,
> >>
> >> graph database support is not quite as easy as it seems.
> >> Unlike relational databases, graphs have not only defined vertices and
> edges (labeled vertices and edges), they are directed or not and might have
> attributes at the nodes and edges, too.
> >>
> >> This makes it a bit confusing for a general interface.
> >>
> >> In general, a graph database should always be accessed via TinkerPop 3
> (or higher), since every professional graph database supports TinkerPop.
> >> TinkerPop is for graph databases what jdbc is for relational databases.
> >>
> >> I tried to create a general NiFi processor for graph databases myself
> and then quit.
> >> Unlike relational databases, graph databases usually have many
> dependencies.
> >>
> >> You do not simply create a data set but search for a particular vertex
> (which may still have certain edges) and create further edges and vertices
> at that.
> >> And the search for the correct node is usually context-related.
> >>
> >> This makes it difficult to do something general for all requirements.
> >>
> >> In any case I am looking forward to your concept and how you want to
> solve it.
> >> It's definitely a good idea but hard to solve.
> >>
> >> Btw.: You forgot the most important graph database - Janusgraph.
> >>
> >> Mit freundlichen Grüßen / best regards
> >> Kay-Uwe Moosheimer
> >>
> >>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mikerthom...@gmail.com>:
> >>>
> >>> I was wondering if anyone on the dev list had given much thought to
> graph
> >>> database support in NiFi. There are a lot of graph databases out there,
> and
> >>> many of them seem to be half-baked or barely supported. Narrowing it
> down,
> >>> it looks like the best candidates for a no fuss, decent sized graph
> that we
> >>> could build up with NiFi processors would be OrientDB, Neo4J and
> ArangoDB.
> >>> The first two are particularly attractive because they offer JDBC
> drivers
> >>> which opens the potential to making them even part of the standard
> >>> JDBC-based processors.
> >>>
> >>> Anyone have any opinions or insights on this issue? I might have to do
> >>> OrientDB anyway, but if someone has a good feel for the market and can
> make
> >>> recommendations that would be appreciated.
> >>>
> >>> Thanks,
> >>>
> >>> Mike
>

Re: Graph database support w/ NiFi

Reply via email to