Uwe and Matt, Now that we're dipping our toes into Neo4J and Cypher, any thoughts on this?
https://github.com/opencypher/cypher-for-gremlin I'm wondering if we shouldn't work with mans2singh to take the Neo4J work and push it further into having a client API that can let us inject a service that uses that or one that uses Neo4J's drivers. Mike On Mon, May 14, 2018 at 7:13 AM Otto Fowler <ottobackwa...@gmail.com> wrote: > The wiki discussion should list these and other points of concern and > should document the extent to which > they are to be addressed. > > > On May 12, 2018 at 12:37:59, u...@moosheimer.com (u...@moosheimer.com) > wrote: > > Matt, > > You have some interesting ideas that I really like. > GraphReaders and GraphWriters would be interesting. When I started > writing a graph processor with my idea, the concept was not yet > implemented in NiFi. > I don't find GraphML and GraphSON so tingly because they contain e.g. > the Vertex/Edge IDs and serve as import and export format to my > knowledge (correct me if I'm wrong). > > A ConvertRecordToGraph processor is a good approach, the only question > is from which format we can convert? > > I also think to make a graph processor a bit general we would have to > provide a query as input which provides the correct vertex from which > the graph should be extended. > Maybe like your suggestion with a gremlin query or a small gremlin script. > > If a vertex is found a new edge and a new vertex are added. > It asks how we transmit the individual attributes to the edge and vertex > as well as the labels of the edge and vertex? Possibly with NiFi > attributes? > > I have some headaches about the complexity. > A small example: > Imagine we have a set from a CSV file. > The columns are Set ID, Token1, Token2, Token3... > ID, Token1,Token2,Token3,Token4,Token5 > 123, Mary, had, a, little, lamp > > I want to create a vertex with ID 123 (if not exists). Then I want to > check for each token if a vertex exists in the graph database (search > for vertex with label "Token" and attribute "name"="Mary"). If the > vertex does not exist, the vertex has to be created. > Since I want to save e.g. Wikipedia to my graph I want to avoid the > supernode problem for the token vertices. I create a few distribution > vertices for each vertex that belongs to a token. If there is a vertex > for Token1(Mary) then I don't want to make the edge from this vertex to > my vertex with the ID 123, but from one of the distribution vertices. > If the vertex for the token does not exist, the distribution vertices > have also to be created ... and so on... > > Even with this very simple example it seems to become difficult with a > universal processor. > > In any case I think the idea to implement a graph processor in NiFi is a > good one. > The more we work on it the more good ideas we get and maybe only I can't > see the forest for the trees. > > One question about Titan. To my knowledge, Titan has been dead for a > year and a half and Janusgraph is the successor? > Titan has become unofficially Datastax Enterprise Graph?! > Supporting Titan could become difficult because Titan does not support > my knowledge after TinkerPop 3 and is no longer maintained. > > I like your idea for a wiki page for more ideas. In the many mails one > loses oneself otherwise. > > Regards, > Kay-Uwe > > Am 12.05.2018 um 16:52 schrieb Matt Burgess: > > All, > > > > As Joe implied, I'm very happy that we are discussing graph tech in > > relation to NiFi! NiFi and Graph theory/tech/analytics are passions of > > mine. Mike, the examples you list are great, I would add Titan (and > > its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these > > and others are at [1]). I think there are at least four aspects to > > this: > > > > 1) Graph query/traversal: This deals with getting data out of a graph > > database and into flow file(s) for further processing. Here I agree > > with Kay-Uwe that we should consider Apache Tinkerpop as the main > > library for graph query/traversal, for a few reasons. The first as > > Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to > > connect to various databases, from Mike's list I believe ArangoDB is > > the only one that does not yet have a TP adapter. The second is > > informed by the first, TP is a standard interface and graph traversal > > engine with a common DSL in Gremlin. A third is that Gremlin is a > > Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax > > and you can call Groovy/Gremlin from Java and vice versa. A third is > > that Tinkerpop is an Apache TLP with a very active and vibrant > > community, so we will be able to reap the benefits of all the graph > > goodness they develop moving forward. I think a QueryGraph processor > > could be appropriate, perhaps with a GraphDBConnectionPool controller > > service or something of the like. Apache DBCP can't do the pooling for > > us, but we could implement something similar to that for pooling TP > > connections. > > > > 2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is > > a graph traversal language, and although its API has addVertex() and > > addEdge() methods and such, it seems like an inefficient solution, > > akin to using individual INSERTs in an RDBMS rather than a > > PreparedStatement or a bulk load. Keeping the analogy, bulk loading in > > RDBMSs is usually specific to that DB, and the same goes for graphs. > > The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has > > external tools (not sure if there's a Java API or not) and Cypher, > > OrientDB has an ETL pipeline system, etc. If we have a standard Graph > > concept, we could have controller services / writers that are > > system-specific (see aspect #4). > > > > 3) Arbitrary data -> Graph: Converting non-graph data into a graph > > almost always takes domain knowledge, which NiFi itself won't have and > > will thus have to be provided by the user. We'd need to make it as > > simple as possible but also as powerful and flexible as possible in > > order to get the most value. We can investigate how each of the > > systems in aspect #2 approaches this, and perhaps come up with a good > > user experience around it. > > > > 4) Organization and implementation: I think we should make sure to > > keep the capabilities very loosely coupled in terms of which > > modules/NARs/JARs provide which capabilities, to allow for maximum > > flexibility and ease of future development. I would prefer an > > API/libraries module akin to nifi-hadoop-libraries-nar, which would > > only include Apache Tinkerpop and any dependencies needed to do "pure" > > graph stuff, so probably no TP adapters except tinkergraph (and/or its > > faster fork from ShiftLeft [2]). The reason I say that is so NiFi > > components (and even the framework!) could use graphs in a lightweight > > manner, without lots of heavy and possibly unnecessary dependencies. > > Imagine being able to query your own flows using Gremlin or Cypher! I > > also envision an API much like the Record API in NiFi but for graphs, > > so we'd have GraphReaders and GraphWriters perhaps, they could convert > > from GraphML to GraphSON or Kryo for example, or in conjunction with a > > ConvertRecordToGraph processor, could be used to support the > > capability in aspect #3 above. I'd also be looking at bringing in > > Gremlin to the scripting processors, or having a Gremlin based > > scripting bundle as NiFi's graph capabilities mature. > > > > You might be able to tell I'm excited about this discussion ;) Should > > we get a Wiki page going for ideas, and/or keep it going here, or > > something else? I'm all ears for thoughts, questions, and ideas > > (especially the ones that might seem crazy!) > > > > Regards, > > Matt > > > > [1] http://tinkerpop.apache.org/providers.html > > [2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin > > > > On Sat, May 12, 2018 at 8:02 AM, u...@moosheimer.com <u...@moosheimer.com> > wrote: > >> Hi Mike, > >> > >> graph database support is not quite as easy as it seems. > >> Unlike relational databases, graphs have not only defined vertices and > edges (labeled vertices and edges), they are directed or not and might have > attributes at the nodes and edges, too. > >> > >> This makes it a bit confusing for a general interface. > >> > >> In general, a graph database should always be accessed via TinkerPop 3 > (or higher), since every professional graph database supports TinkerPop. > >> TinkerPop is for graph databases what jdbc is for relational databases. > >> > >> I tried to create a general NiFi processor for graph databases myself > and then quit. > >> Unlike relational databases, graph databases usually have many > dependencies. > >> > >> You do not simply create a data set but search for a particular vertex > (which may still have certain edges) and create further edges and vertices > at that. > >> And the search for the correct node is usually context-related. > >> > >> This makes it difficult to do something general for all requirements. > >> > >> In any case I am looking forward to your concept and how you want to > solve it. > >> It's definitely a good idea but hard to solve. > >> > >> Btw.: You forgot the most important graph database - Janusgraph. > >> > >> Mit freundlichen Grüßen / best regards > >> Kay-Uwe Moosheimer > >> > >>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mikerthom...@gmail.com>: > >>> > >>> I was wondering if anyone on the dev list had given much thought to > graph > >>> database support in NiFi. There are a lot of graph databases out there, > and > >>> many of them seem to be half-baked or barely supported. Narrowing it > down, > >>> it looks like the best candidates for a no fuss, decent sized graph > that we > >>> could build up with NiFi processors would be OrientDB, Neo4J and > ArangoDB. > >>> The first two are particularly attractive because they offer JDBC > drivers > >>> which opens the potential to making them even part of the standard > >>> JDBC-based processors. > >>> > >>> Anyone have any opinions or insights on this issue? I might have to do > >>> OrientDB anyway, but if someone has a good feel for the market and can > make > >>> recommendations that would be appreciated. > >>> > >>> Thanks, > >>> > >>> Mike >