Re: [Discuss graph source/sink design proposal]

Saikat Kanjilal Wed, 13 Jul 2016 18:26:48 -0700

I suppose I'll need to do the same with neo4j.
Thanks

Sent from my iPhone


> On Jul 13, 2016, at 6:21 PM, Mike Percy <[email protected]> wrote:
> 
> For the Flume-Kafka integration we start up Kafka mini clusters in the unit
> tests. It depends on the server. The project doesn't have any permanent
> infrastructure in place with long running servers.
> 
> Mike
> 
> On Wed, Jul 13, 2016 at 5:37 PM, Saikat Kanjilal <[email protected]>
> wrote:
> 
>> Mike et al,
>> 
>> Out of curiosity how do committers usually run integration tests when
>> doing flume sink development, at some point I will have the graph sink
>> talking to neo4j and would really rather not have to test everything
>> locally as the performance of testing locally would make the whole
>> operation not really reflect the actual sink performance.  Any ideas on how
>> to get past this.  I'm not there yet but will be there in a few weeks where
>> I'll need to start perf/integration testing.
>> 
>> 
>> Thanks in advance.
>> 
>> 
>> ________________________________
>> From: Saikat Kanjilal <[email protected]>
>> Sent: Saturday, July 9, 2016 8:16 AM
>> To: [email protected]
>> Subject: Re: [Discuss graph source/sink design proposal]
>> 
>> Mike et al,
>> 
>> To clarify again I'm starting with the hbase sink and modifying it to
>> match the graph use case.  This si probably why you saw the hbase stuff
>> still left over.  In a nutshell the design will look like the following:
>> 
>> 
>> flume->neo4j (sink workflow)
>> 
>> We batch events up from flume, we use the neo4j bolt driver to convert the
>> batch of events into cipher statements and then we send the data in bulk
>> into neo4j, one open question here might be how many go in a batch and
>> should this be dynamically configurable
>> 
>> 
>> neo4j->flume (source workflow)
>> 
>> We add event listeners inside neo4j and then send data back into flume
>> through these listeners, although here we'd need to really be careful about
>> sending every single event, a batching strategy here might also make sense
>> but takes out the concept of real time updates
>> 
>> 
>> More later as I make more progress, also your criteria for acceptance of
>> this sink is no different than accepting contributions to any other open
>> source project , I guess I'd like to also know if there's interest from the
>> community in connecting flume with neo4j as that would generate more
>> feedback on the design.
>> 
>> Here's a blurb on the new neo4j java and other languages interface:
>> 
>> https://neo4j.com/blog/neo4j-3-0-language-drivers/
>> A Deeper Dive into Neo4j 3.0 Language Drivers<
>> https://neo4j.com/blog/neo4j-3-0-language-drivers/>
>> neo4j.com
>> Discover the four new language drivers for Neo4j 3.0 that provide easy
>> access to Neo4j through a uniform API, regardless of programming language.
>> 
>> 
>> 
>> 
>> 
>> Thanks
>> A Deeper Dive into Neo4j 3.0 Language Drivers<
>> https://neo4j.com/blog/neo4j-3-0-language-drivers/>
>> neo4j.com
>> Discover the four new language drivers for Neo4j 3.0 that provide easy
>> access to Neo4j through a uniform API, regardless of programming language.
>> 
>> 
>> 
>> 
>> ________________________________
>> From: Mike Percy <[email protected]>
>> Sent: Friday, July 8, 2016 6:22 PM
>> To: [email protected]
>> Subject: Re: [Discuss graph source/sink design proposal]
>> 
>> Hi Saikat, please see my responses inline.
>> 
>> On Thu, Jul 7, 2016 at 8:50 PM, Saikat Kanjilal <[email protected]>
>> wrote:
>> 
>>> Ok moved the code to here:
>>> https://bitbucket.org/skanjila/flume-ng-graph-sink
>> [
>> https://d301sr5gafysq2.cloudfront.net/e5b75889441d/img/repo-avatars/default.svg
>> ]<https://bitbucket.org/skanjila/flume-ng-graph-sink>
>> 
>> skanjila / flume-ng-graph-sink<
>> https://bitbucket.org/skanjila/flume-ng-graph-sink>
>> bitbucket.org
>> Git repository hosted by Bitbucket.
>> 
>> 
>> 
>> 
>> 
>> It looks like mostly still HBaseSink code right now, just with a different
>> package name. I only looked at the Async one and that's what I found.
>> 
>> Also I am exploring using the https://github.com/neo4j/neo4j-java-driver
>> using
>>> the bolt protocol to connect to neo4j to stream events
>> 
>> I don't know anything about Neo4J personally. Unfortunately I don't have
>> time to really participate in development of this new sink using technology
>> I have no use for, myself. Maybe there are others on this list that have
>> the time and interest to help.
>> 
>> Looking forward to getting feedback on this effort as y'all have time.
>> 
>> I apologize for not having the time to provide much guidance beyond the
>> capabilities of Flume itself.
>> 
>> In the future, as a committer on Flume, I would personally consider merging
>> Neo4J support into the Flume source tree if the following conditions were
>> met:
>> 
>> 1. Strong feedback from others that this connector is desired by multiple
>> members of the community
>> 2. An implementation that is well designed, tested, and production-grade
>> 3. A likely long-term maintainer (maybe that is you?)
>> 
>> The reason I hesitate to add more integrations into the core is that if
>> this breaks, and someone is using it, we will have to fix it. If someone
>> asks a question on the mailing lists, we will have to attempt to answer it.
>> 
>> Regards,
>> Mike
>> 
>> 
>> From: Saikat Kanjilal <[email protected]>
>>> Sent: Thursday, July 7, 2016 9:31 AM
>>> To: [email protected]
>>> Subject: Re: [Discuss graph source/sink design proposal]
>>> 
>>> Would it be ok to use bitbucket instead?  I have indeed extended
>>> AbstractSink to build the graph sink, I will depend on flume-ng-core on
>> my
>>> pom as well.
>>> 
>>> Thanks and feel free to respond on the cipher discussion as well as the
>>> other items I mentioned earlier.
>>> 
>>> 
>>> ________________________________
>>> From: Mike Percy <[email protected]>
>>> Sent: Monday, July 4, 2016 12:03 PM
>>> To: [email protected]
>>> Subject: Re: [Discuss graph source/sink design proposal]
>>> 
>>> Hi Saikat,
>>> I recommend you use GitHub. Private branches in ASF repos are only
>>> available to committers.
>>> 
>>> Regarding forking Flume, you should not need to do that. Just depend on
>>> flume-ng-core in your pom and extend AbstractSink. Maven will pull in
>> your
>>> deps.
>>> 
>>> I'm out of town for the next few days but I'll try to respond in more
>>> detail to your design notes when I'm back in town.
>>> 
>>> Mike
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Jul 4, 2016, at 6:59 AM, Saikat Kanjilal <[email protected]>
>> wrote:
>>>> 
>>>> Hari/Mike et al,
>>>> 
>>>> I need a place to put interim checkins related to this work, is it
>>> possible to get write privileges into a private branch so that I can
>> commit
>>> my code at intermediate junctures, I can also put it in bitbucket but
>> would
>>> rather not have to create yet another place for the code to live if it'll
>>> eventually end up in the flume repo.
>>>> 
>>>> 
>>>> Thanks in advance
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Saikat Kanjilal <[email protected]>
>>>> Sent: Thursday, June 30, 2016 10:16 PM
>>>> To: [email protected]
>>>> Subject: RE: [Discuss graph source/sink design proposal]
>>>> 
>>>> So I've started the coding efforts on this, here's some details:
>>>> 1) I've cloned the hbase sink for now and am refactoring all of that
>>> code to work with neo4j as a start2) I'm only focusing on creating a sink
>>> that will perform basic CRUD streaming operations into neo4j3) I've sent
>> an
>>> email to the neo4j guys to figure out details around building a streaming
>>> architecture with the neo4j kernel4) In the meantime how would you guys
>>> like to review the code, I've cloned the flume repo and have created a
>>> branch called flume-2035 where I will work, should I put all the code in
>>> bitbucket and send out periodic reviews, this is going to be a sizeable
>>> effort5) How should we think about cipher related workflows as it relates
>>> to the streaming data coming in , to see a ful flavor for cipher go here
>>> https://neo4j.com/developer/cypher-query-language/
>>> Neo4j's Graph Query Language: An Introduction to Cypher<
>>> https://neo4j.com/developer/cypher-query-language/>
>>> neo4j.com
>>> Master the basics of Cypher – the graph query language for Neo4j – with
>>> this introductory guide that teaches you how to read and write Cypher
>>> queries.
>>> 
>>> 
>>> 
>>> Neo4j's Graph Query Language: An Introduction to Cypher<
>>> https://neo4j.com/developer/cypher-query-language/>
>>> neo4j.com
>>> Master the basics of Cypher – the graph query language for Neo4j – with
>>> this introductory guide that teaches you how to read and write Cypher
>>> queries.
>>> 
>>> 
>>> 
>>>> Neo4j's Graph Query Language: An Introduction to Cypher<
>>> https://neo4j.com/developer/cypher-query-language/>
>>>> neo4j.com
>>>> Master the basics of Cypher – the graph query language for Neo4j – with
>>> this introductory guide that teaches you how to read and write Cypher
>>> queries.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Would love to get some discussion going on 2-5.
>>>> Thanks
>>>> 
>>>>> From: [email protected]
>>>>> Date: Wed, 29 Jun 2016 17:24:16 -0700
>>>>> Subject: Re: [Discuss graph source/sink design proposal]
>>>>> To: [email protected]
>>>>> 
>>>>> Hmm, maybe a different Kudu project? Not sure.
>>>>> 
>>>>> Anyway, this type of "changelog" thing would require support in the DB
>>> for
>>>>> streaming its write-ahead log or something. For example, we don't
>>> support
>>>>> that in Apache Kudu (incubating) -- maybe someday.
>>>>> 
>>>>> Regarding Flume, I usually think it's useful to distinguish between a
>>>>> source and a sink. They are typically written as separate classes and
>>> they
>>>>> represent different interfaces at the Flume Java API level.
>>>>> 
>>>>> So, how would one write a streaming database source? That really
>>> depends on
>>>>> the database and the APIs it provides for that.
>>>>> 
>>>>> Mike
>>>>> 
>>>>> On Tue, Jun 28, 2016 at 8:30 AM, Saikat Kanjilal <[email protected]
>>> 
>>>>> wrote:
>>>>> 
>>>>>> :) I'm using Kudu at work at the moment to troubleshoot some Tomcat
>>>>>> issues,  regarding the where to keep the source code I would say for
>>> now
>>>>>> lets go with the plugin approach and revisit the "where does the code
>>> live"
>>>>>> conversation later.  One thing I do want to discuss is that the
>> plugin
>>> will
>>>>>> act as a source or a sink depending on configuration, so if the
>> plugin
>>> acts
>>>>>> as a source we need a mechanism (like a daemon in syslog) to stream
>>> changes
>>>>>> real time from a graphdb into flume, I was wondering if there are any
>>> past
>>>>>> approaches around this that I can follow, I may need to dig into the
>>> neo4j
>>>>>> kernel to see where we can inject something like this.
>>>>>> Thoughts on that?
>>>>>> 
>>>>>>> From: [email protected]
>>>>>>> Date: Tue, 28 Jun 2016 00:27:45 -0700
>>>>>>> Subject: Re: [Discuss graph source/sink design proposal]
>>>>>>> To: [email protected]
>>>>>>> 
>>>>>>> Hi Saikat,
>>>>>>> Please see my thoughts inline. This is how I think about this stuff;
>>>>>> others
>>>>>>> may think about it differently.
>>>>>>> 
>>>>>>> On Mon, Jun 27, 2016 at 8:45 PM, Saikat Kanjilal <
>> [email protected]
>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Exactly right, I'm proposing we create a graph sink for flume while
>>>>>>>> keeping the flume core intact.
>>>>>>> 
>>>>>>> 
>>>>>>> As you are probably aware, sources and sinks don't have to be part
>> of
>>> the
>>>>>>> main Apache Flume source tree to be used with Flume. The plugins.d
>>>>>>> mechanism described in [1] makes building and integrating separate
>>>>>> plugins
>>>>>>> into Flume an easy thing to do at deployment time.
>>>>>>> 
>>>>>>> In another project I work on, Apache Kudu (incubating), we have a
>>> Flume
>>>>>>> Kudu sink committed in the main source tree [2]. We may at some
>> point
>>>>>>> propose to move it into the Flume source tree, but for now (for
>>> testing
>>>>>> and
>>>>>>> API stability reasons) it's easier to keep it in the Kudu source
>> tree.
>>>>>>> 
>>>>>>> Likewise, you could implement a Flume Neo4J sink and post it up on
>>> GitHub
>>>>>>> (or maybe in the Neo4J tree?). Donating it to the Apache Flume
>> project
>>>>>> once
>>>>>>> it's in decent shape may make sense at some point, especially if the
>>>>>>> dependencies are easy to share and integrate into the Flume project.
>>>>>>> However, I wouldn't say that it's a foregone conclusion that it
>> really
>>>>>>> needs to be part of the Flume source tree. Assuming you need the
>> sink,
>>>>>> and
>>>>>>> are going to implement it anyway, then maybe we can defer the
>>> discussion
>>>>>> of
>>>>>>> whether to include it in the Flume source tree until later. One of
>> the
>>>>>>> things I try to keep in mind when integrating new plugin code is
>>> whether
>>>>>>> the project will be able to support the maintenance burden of the
>> new
>>>>>> code.
>>>>>>> 
>>>>>>> In reading from a graph db we need a mechanism to stream data from
>> the
>>>>>>>> graph store into flume.
>>>>>>> 
>>>>>>> Yes, I'd say it could potentially make sense to create a Flume Neo4J
>>>>>> source
>>>>>>> as well. I think the same logic as above would still apply.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Mike
>>>>>>> 
>>>>>>> [1]
>> https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins
>>>>>>> [2]
>> https://github.com/apache/incubator-kudu/tree/master/java/kudu-flume-sink
>>

Re: [Discuss graph source/sink design proposal]

Reply via email to