Re: Graph database support w/ NiFi

Otto Fowler Mon, 14 May 2018 04:10:50 -0700

+1 for the wiki page

On May 12, 2018 at 10:52:43, Matt Burgess (mattyb...@apache.org) wrote:

All,

As Joe implied, I'm very happy that we are discussing graph tech in
relation to NiFi! NiFi and Graph theory/tech/analytics are passions of
mine. Mike, the examples you list are great, I would add Titan (and
its fork Janusgraph as Kay-Uwe mentioned) and Azure CosmosDB (these
and others are at [1]). I think there are at least four aspects to
this:

1) Graph query/traversal: This deals with getting data out of a graph
database and into flow file(s) for further processing. Here I agree
with Kay-Uwe that we should consider Apache Tinkerpop as the main
library for graph query/traversal, for a few reasons. The first as
Kay-Uwe said is that there are many adapters for Tinkerpop (TP) to
connect to various databases, from Mike's list I believe ArangoDB is
the only one that does not yet have a TP adapter. The second is
informed by the first, TP is a standard interface and graph traversal
engine with a common DSL in Gremlin. A third is that Gremlin is a
Groovy-based DSL, and Groovy syntax is fairly close to Java 8+ syntax
and you can call Groovy/Gremlin from Java and vice versa. A third is
that Tinkerpop is an Apache TLP with a very active and vibrant
community, so we will be able to reap the benefits of all the graph
goodness they develop moving forward. I think a QueryGraph processor
could be appropriate, perhaps with a GraphDBConnectionPool controller
service or something of the like. Apache DBCP can't do the pooling for
us, but we could implement something similar to that for pooling TP
connections.

2) Graph ingest: This one IMO is the long pole in the tent. Gremlin is
a graph traversal language, and although its API has addVertex() and
addEdge() methods and such, it seems like an inefficient solution,
akin to using individual INSERTs in an RDBMS rather than a
PreparedStatement or a bulk load. Keeping the analogy, bulk loading in
RDBMSs is usually specific to that DB, and the same goes for graphs.
The Titan-based ones have Titan-Hadoop (formerly Faunus), Neo4j has
external tools (not sure if there's a Java API or not) and Cypher,
OrientDB has an ETL pipeline system, etc. If we have a standard Graph
concept, we could have controller services / writers that are
system-specific (see aspect #4).

3) Arbitrary data -> Graph: Converting non-graph data into a graph
almost always takes domain knowledge, which NiFi itself won't have and
will thus have to be provided by the user. We'd need to make it as
simple as possible but also as powerful and flexible as possible in
order to get the most value. We can investigate how each of the
systems in aspect #2 approaches this, and perhaps come up with a good
user experience around it.

4) Organization and implementation: I think we should make sure to
keep the capabilities very loosely coupled in terms of which
modules/NARs/JARs provide which capabilities, to allow for maximum
flexibility and ease of future development. I would prefer an
API/libraries module akin to nifi-hadoop-libraries-nar, which would
only include Apache Tinkerpop and any dependencies needed to do "pure"
graph stuff, so probably no TP adapters except tinkergraph (and/or its
faster fork from ShiftLeft [2]). The reason I say that is so NiFi
components (and even the framework!) could use graphs in a lightweight
manner, without lots of heavy and possibly unnecessary dependencies.
Imagine being able to query your own flows using Gremlin or Cypher! I
also envision an API much like the Record API in NiFi but for graphs,
so we'd have GraphReaders and GraphWriters perhaps, they could convert
from GraphML to GraphSON or Kryo for example, or in conjunction with a
ConvertRecordToGraph processor, could be used to support the
capability in aspect #3 above. I'd also be looking at bringing in
Gremlin to the scripting processors, or having a Gremlin based
scripting bundle as NiFi's graph capabilities mature.

You might be able to tell I'm excited about this discussion ;) Should
we get a Wiki page going for ideas, and/or keep it going here, or
something else? I'm all ears for thoughts, questions, and ideas
(especially the ones that might seem crazy!)

Regards,
Matt

[1] http://tinkerpop.apache.org/providers.html
[2] https://github.com/ShiftLeftSecurity/tinkergraph-gremlin

On Sat, May 12, 2018 at 8:02 AM, u...@moosheimer.com <u...@moosheimer.com>
wrote:
> Hi Mike,
>
> graph database support is not quite as easy as it seems.
> Unlike relational databases, graphs have not only defined vertices and
edges (labeled vertices and edges), they are directed or not and might have
attributes at the nodes and edges, too.
>
> This makes it a bit confusing for a general interface.
>
> In general, a graph database should always be accessed via TinkerPop 3
(or higher), since every professional graph database supports TinkerPop.
> TinkerPop is for graph databases what jdbc is for relational databases.
>
> I tried to create a general NiFi processor for graph databases myself and
then quit.
> Unlike relational databases, graph databases usually have many
dependencies.
>
> You do not simply create a data set but search for a particular vertex
(which may still have certain edges) and create further edges and vertices
at that.
> And the search for the correct node is usually context-related.
>
> This makes it difficult to do something general for all requirements.
>
> In any case I am looking forward to your concept and how you want to
solve it.
> It's definitely a good idea but hard to solve.
>
> Btw.: You forgot the most important graph database - Janusgraph.
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
>> Am 12.05.2018 um 13:01 schrieb Mike Thomsen <mikerthom...@gmail.com>:
>>
>> I was wondering if anyone on the dev list had given much thought to
graph
>> database support in NiFi. There are a lot of graph databases out there,
and
>> many of them seem to be half-baked or barely supported. Narrowing it
down,
>> it looks like the best candidates for a no fuss, decent sized graph that
we
>> could build up with NiFi processors would be OrientDB, Neo4J and
ArangoDB.
>> The first two are particularly attractive because they offer JDBC
drivers
>> which opens the potential to making them even part of the standard
>> JDBC-based processors.
>>
>> Anyone have any opinions or insights on this issue? I might have to do
>> OrientDB anyway, but if someone has a good feel for the market and can
make
>> recommendations that would be appreciated.
>>
>> Thanks,
>>
>> Mike
>

Re: Graph database support w/ NiFi

Reply via email to