Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
code updated. sorry, wrong branch uploaded before.

On Fri, Jan 16, 2015 at 2:13 PM, Kushal Datta 
wrote:

> The source code is under a new module named 'graphx'. let me double check.
>
> On Fri, Jan 16, 2015 at 2:11 PM, Kyle Ellrott 
> wrote:
>
>> Looking at https://github.com/kdatta/tinkerpop3/compare/graphx-gremlin I
>> only see a maven build file. Do you have some source code some place else?
>>
>> I've worked on a spark based implementation (
>> https://github.com/kellrott/spark-gremlin ), but its not done and I've
>> been tied up on other projects.
>> It also look Tinkerpop3 is a bit of a moving target. I had targeted the
>> work done for gremlin-giraph (
>> http://www.tinkerpop.com/docs/3.0.0.M5/#giraph-gremlin ) that was part
>> of the M5 release, as a base model for implementation. But that appears to
>> have been refactored into gremlin-hadoop (
>> http://www.tinkerpop.com/docs/3.0.0.M6/#hadoop-gremlin ) in the M6
>> release. I need to assess how much this changes the code.
>>
>> Most of the code that needs to be changes from Giraph to Spark will be
>> simply replacing classes with spark derived ones. The main place where the
>> logic will need changed is in the 'GraphComputer' class (
>> https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java
>> ) which is created by the Graph when the 'compute' method is called (
>> https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/structure/HadoopGraph.java#L135
>> ).
>>
>>
>> Kyle
>>
>>
>>
>> On Fri, Jan 16, 2015 at 1:01 PM, Kushal Datta 
>> wrote:
>>
>>> Hi David,
>>>
>>>
>>> Yes, we are still headed in that direction.
>>> Please take a look at the repo I sent earlier.
>>> I think that's a good starting point.
>>>
>>> Thanks,
>>> -Kushal.
>>>
>>> On Thu, Jan 15, 2015 at 8:31 AM, David Robinson 
>>> wrote:
>>>
>>> > I am new to Spark and GraphX, however, I use Tinkerpop backed graphs
>>> and
>>> > think the idea of using Tinkerpop as the API for GraphX is a great
>>> idea and
>>> > hope you are still headed in that direction.  I noticed that Tinkerpop
>>> 3 is
>>> > moving into the Apache family:
>>> > http://wiki.apache.org/incubator/TinkerPopProposal  which might
>>> alleviate
>>> > concerns about having an API definition "outside" of Spark.
>>> >
>>> > Thanks,
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> > Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>> >
>>>
>>
>>
>


Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
The source code is under a new module named 'graphx'. let me double check.

On Fri, Jan 16, 2015 at 2:11 PM, Kyle Ellrott  wrote:

> Looking at https://github.com/kdatta/tinkerpop3/compare/graphx-gremlin I
> only see a maven build file. Do you have some source code some place else?
>
> I've worked on a spark based implementation (
> https://github.com/kellrott/spark-gremlin ), but its not done and I've
> been tied up on other projects.
> It also look Tinkerpop3 is a bit of a moving target. I had targeted the
> work done for gremlin-giraph (
> http://www.tinkerpop.com/docs/3.0.0.M5/#giraph-gremlin ) that was part of
> the M5 release, as a base model for implementation. But that appears to
> have been refactored into gremlin-hadoop (
> http://www.tinkerpop.com/docs/3.0.0.M6/#hadoop-gremlin ) in the M6
> release. I need to assess how much this changes the code.
>
> Most of the code that needs to be changes from Giraph to Spark will be
> simply replacing classes with spark derived ones. The main place where the
> logic will need changed is in the 'GraphComputer' class (
> https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java
> ) which is created by the Graph when the 'compute' method is called (
> https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/structure/HadoopGraph.java#L135
> ).
>
>
> Kyle
>
>
>
> On Fri, Jan 16, 2015 at 1:01 PM, Kushal Datta 
> wrote:
>
>> Hi David,
>>
>>
>> Yes, we are still headed in that direction.
>> Please take a look at the repo I sent earlier.
>> I think that's a good starting point.
>>
>> Thanks,
>> -Kushal.
>>
>> On Thu, Jan 15, 2015 at 8:31 AM, David Robinson 
>> wrote:
>>
>> > I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and
>> > think the idea of using Tinkerpop as the API for GraphX is a great idea
>> and
>> > hope you are still headed in that direction.  I noticed that Tinkerpop
>> 3 is
>> > moving into the Apache family:
>> > http://wiki.apache.org/incubator/TinkerPopProposal  which might
>> alleviate
>> > concerns about having an API definition "outside" of Spark.
>> >
>> > Thanks,
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>>
>
>


Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kyle Ellrott
Looking at https://github.com/kdatta/tinkerpop3/compare/graphx-gremlin I
only see a maven build file. Do you have some source code some place else?

I've worked on a spark based implementation (
https://github.com/kellrott/spark-gremlin ), but its not done and I've been
tied up on other projects.
It also look Tinkerpop3 is a bit of a moving target. I had targeted the
work done for gremlin-giraph (
http://www.tinkerpop.com/docs/3.0.0.M5/#giraph-gremlin ) that was part of
the M5 release, as a base model for implementation. But that appears to
have been refactored into gremlin-hadoop (
http://www.tinkerpop.com/docs/3.0.0.M6/#hadoop-gremlin ) in the M6 release.
I need to assess how much this changes the code.

Most of the code that needs to be changes from Giraph to Spark will be
simply replacing classes with spark derived ones. The main place where the
logic will need changed is in the 'GraphComputer' class (
https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java
) which is created by the Graph when the 'compute' method is called (
https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/structure/HadoopGraph.java#L135
).


Kyle



On Fri, Jan 16, 2015 at 1:01 PM, Kushal Datta 
wrote:

> Hi David,
>
>
> Yes, we are still headed in that direction.
> Please take a look at the repo I sent earlier.
> I think that's a good starting point.
>
> Thanks,
> -Kushal.
>
> On Thu, Jan 15, 2015 at 8:31 AM, David Robinson 
> wrote:
>
> > I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and
> > think the idea of using Tinkerpop as the API for GraphX is a great idea
> and
> > hope you are still headed in that direction.  I noticed that Tinkerpop 3
> is
> > moving into the Apache family:
> > http://wiki.apache.org/incubator/TinkerPopProposal  which might
> alleviate
> > concerns about having an API definition "outside" of Spark.
> >
> > Thanks,
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
Hi David,


Yes, we are still headed in that direction.
Please take a look at the repo I sent earlier.
I think that's a good starting point.

Thanks,
-Kushal.

On Thu, Jan 15, 2015 at 8:31 AM, David Robinson 
wrote:

> I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and
> think the idea of using Tinkerpop as the API for GraphX is a great idea and
> hope you are still headed in that direction.  I noticed that Tinkerpop 3 is
> moving into the Apache family:
> http://wiki.apache.org/incubator/TinkerPopProposal  which might alleviate
> concerns about having an API definition "outside" of Spark.
>
> Thanks,
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Implementing TinkerPop on top of GraphX

2015-01-15 Thread David Robinson
I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and
think the idea of using Tinkerpop as the API for GraphX is a great idea and
hope you are still headed in that direction.  I noticed that Tinkerpop 3 is
moving into the Apache family:
http://wiki.apache.org/incubator/TinkerPopProposal  which might alleviate
concerns about having an API definition "outside" of Spark.

Thanks,




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Implementing TinkerPop on top of GraphX

2014-11-20 Thread Kushal Datta
le
>>>>> to all GraphX users. I think it would be best run as a separate
>>>>> module/project that builds directly on top of GraphX.
>>>>>
>>>>> Kyle
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon <
>>>>> brennon.y...@capitalone.com> wrote:
>>>>>
>>>>>> My personal 2c is that, since GraphX is just beginning to provide a
>>>>>> full featured graph API, I think it would be better to align with the
>>>>>> TinkerPop group rather than roll our own. In my mind the benefits out way
>>>>>> the detriments as follows:
>>>>>>
>>>>>> Benefits:
>>>>>> * GraphX gains the ability to become another core tenant within the
>>>>>> TinkerPop community allowing a more diverse group of users into the Spark
>>>>>> ecosystem.
>>>>>> * TinkerPop can continue to maintain and own a solid / feature-rich
>>>>>> graph API that has already been accepted by a wide audience, relieving 
>>>>>> the
>>>>>> pressure of “one off” API additions from the GraphX team.
>>>>>> * GraphX can demonstrate its ability to be a key player in the
>>>>>> GraphDB space sitting inline with other major distributions (Neo4j, 
>>>>>> Titan,
>>>>>> etc.).
>>>>>> * Allows for the abstract graph traversal logic (query API) to be
>>>>>> owned and maintained by a group already proven on the topic.
>>>>>>
>>>>>> Drawbacks:
>>>>>> * GraphX doesn’t own the API for its graph query capability. This
>>>>>> could be seen as good or bad, but it might make GraphX-specific
>>>>>> implementation additions more tricky (possibly). Also, GraphX will need 
>>>>>> to
>>>>>> maintain the features described within the TinkerPop API as that might
>>>>>> change in the future.
>>>>>>
>>>>>> From: Kushal Datta 
>>>>>> Date: Thursday, November 6, 2014 at 4:00 PM
>>>>>> To: "York, Brennon" 
>>>>>> Cc: Kyle Ellrott , Reynold Xin <
>>>>>> r...@databricks.com>, "dev@spark.apache.org" ,
>>>>>> Matthias Broecheler 
>>>>>>
>>>>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>>>>
>>>>>> Before we dive into the implementation details, what are the high
>>>>>> level thoughts on Gremlin/GraphX? Scala already provides the procedural 
>>>>>> way
>>>>>> to query graphs in GraphX today. So, today I can run
>>>>>> g.vertices().filter().join() queries as OLAP in GraphX just like 
>>>>>> Tinkerpop3
>>>>>> Gremlin, of course sans the useful operators that Gremlin offers such as
>>>>>> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin 
>>>>>> operators
>>>>>> to GraphX api's a better approach or should we extend the existing set of
>>>>>> transformations/actions that GraphX already offers with the useful
>>>>>> operators from Gremlin? For example, we add as(), loop() and dedup()
>>>>>> methods in VertexRDD and EdgeRDD.
>>>>>>
>>>>>> Either way we get a desperately needed graph query interface in
>>>>>> GraphX.
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon <
>>>>>> brennon.y...@capitalone.com> wrote:
>>>>>>
>>>>>>> This was my thought exactly with the TinkerPop3 release. Looks like,
>>>>>>> to move this forward, we’d need to implement gremlin-core per <
>>>>>>> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>.
>>>>>>> The real question lies in whether GraphX can only support the OLTP
>>>>>>> functionality, or if we can bake into it the OLAP requirements as well. 
>>>>>>> At
>>>>>>> a first glance I believe we could create an entire OLAP system. If so, I
>>>>>>> believe we could do this in a set of parallel subtasks, those being the
>>>>>>> implementation of each of the individual API’s (Structure, Process, 
>>>>>>> 

Re: Implementing TinkerPop on top of GraphX

2014-11-18 Thread Kyle Ellrott
;>>> TinkerPop community allowing a more diverse group of users into the Spark
>>>>> ecosystem.
>>>>> * TinkerPop can continue to maintain and own a solid / feature-rich
>>>>> graph API that has already been accepted by a wide audience, relieving the
>>>>> pressure of “one off” API additions from the GraphX team.
>>>>> * GraphX can demonstrate its ability to be a key player in the GraphDB
>>>>> space sitting inline with other major distributions (Neo4j, Titan, etc.).
>>>>> * Allows for the abstract graph traversal logic (query API) to be
>>>>> owned and maintained by a group already proven on the topic.
>>>>>
>>>>> Drawbacks:
>>>>> * GraphX doesn’t own the API for its graph query capability. This
>>>>> could be seen as good or bad, but it might make GraphX-specific
>>>>> implementation additions more tricky (possibly). Also, GraphX will need to
>>>>> maintain the features described within the TinkerPop API as that might
>>>>> change in the future.
>>>>>
>>>>> From: Kushal Datta 
>>>>> Date: Thursday, November 6, 2014 at 4:00 PM
>>>>> To: "York, Brennon" 
>>>>> Cc: Kyle Ellrott , Reynold Xin <
>>>>> r...@databricks.com>, "dev@spark.apache.org" ,
>>>>> Matthias Broecheler 
>>>>>
>>>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>>>
>>>>> Before we dive into the implementation details, what are the high
>>>>> level thoughts on Gremlin/GraphX? Scala already provides the procedural 
>>>>> way
>>>>> to query graphs in GraphX today. So, today I can run
>>>>> g.vertices().filter().join() queries as OLAP in GraphX just like 
>>>>> Tinkerpop3
>>>>> Gremlin, of course sans the useful operators that Gremlin offers such as
>>>>> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators
>>>>> to GraphX api's a better approach or should we extend the existing set of
>>>>> transformations/actions that GraphX already offers with the useful
>>>>> operators from Gremlin? For example, we add as(), loop() and dedup()
>>>>> methods in VertexRDD and EdgeRDD.
>>>>>
>>>>> Either way we get a desperately needed graph query interface in GraphX.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon <
>>>>> brennon.y...@capitalone.com> wrote:
>>>>>
>>>>>> This was my thought exactly with the TinkerPop3 release. Looks like,
>>>>>> to move this forward, we’d need to implement gremlin-core per <
>>>>>> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>.
>>>>>> The real question lies in whether GraphX can only support the OLTP
>>>>>> functionality, or if we can bake into it the OLAP requirements as well. 
>>>>>> At
>>>>>> a first glance I believe we could create an entire OLAP system. If so, I
>>>>>> believe we could do this in a set of parallel subtasks, those being the
>>>>>> implementation of each of the individual API’s (Structure, Process, and, 
>>>>>> if
>>>>>> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>>>>>>
>>>>>>
>>>>>> From: Kyle Ellrott 
>>>>>> Date: Thursday, November 6, 2014 at 12:10 PM
>>>>>> To: Kushal Datta 
>>>>>> Cc: Reynold Xin , "York, Brennon" <
>>>>>> brennon.y...@capitalone.com>, "dev@spark.apache.org" <
>>>>>> dev@spark.apache.org>, Matthias Broecheler <
>>>>>> matth...@thinkaurelius.com>
>>>>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>>>>
>>>>>> I still have to dig into the Tinkerpop3 internals (I started my work
>>>>>> long before it had been released), but I can say that to get the 
>>>>>> Tinerpop2
>>>>>> Gremlin pipeline to work in the GraphX was a bit of a hack. The
>>>>>> whole Tinkerpop2 Gremlin design was based around streaming pipes of
>>>>>> data, rather then large distributed map-reduce operations. I had to hack
>>>>>> the pipes to aggregate all of the data and pass a single objec

Re: Implementing TinkerPop on top of GraphX

2014-11-07 Thread Kushal Datta
I think if we are going to use GraphX as the query engine in Tinkerpop3,
then the Tinkerpop3 community is the right platform to further the
discussion.

The reason I asked the question on improving APIs in GraphX is because why
only Gremlin, any graph DSL can exploit the GraphX APIs. Cypher has some
good subgraph matching query interfaces which I believe can be distributed
using GraphX apis.

An edge ID is an internal attribute of the edge generated automatically,
mostly hidden from the user. That's why adding it as an edge property might
not be a good idea. There are several little differences like this. E.g. in
Tinkerpop3 Gremlin implementation for Giraph, only vertex programs are
executed in Giraph directly. The side-effect operators are mapped to
Map-Reduce functions. In the implementation we are talking about, all of
these operations can be done within GraphX. I will be interested to
co-develop the query engine.

@Reynold, I agree. And as I said earlier, the apis should be designed in
such a way that it can be used in any Graph DSL.

On Fri, Nov 7, 2014 at 10:59 AM, Kyle Ellrott  wrote:

> Who here would be interested in helping to work on an implementation of
> the Tikerpop3 Gremlin API for Spark? Is this something that should continue
> in the Spark discussion group, or should it migrate to the Gremlin message
> group?
>
> Reynold is right that there will be inherent mismatches in the APIs, and
> there will need to be some discussions with the GraphX group about the best
> way to go. One example would be edge ids. GraphX has vertex ids, but no
> explicit edges ids, while Gremlin has both. Edge ids could be put into the
> attr field, but then that means the user would have to explicitly subclass
> their edge attribute to the edge attribute interface. Is that worth doing,
> versus adding an id to everyones's edges?
>
> Kyle
>
>
> On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin  wrote:
>
>> Some form of graph querying support would be great to have. This can be a
>> great community project hosted outside of Spark initially, both due to the
>> maturity of the component itself as well as the maturity of query language
>> standards (there isn't really a dominant standard for graph ql).
>>
>> One thing is that GraphX API will need to evolve and probably need to
>> provide more primitives in order to support the new ql implementation.
>> There might also be inherent mismatches in the way the external API is
>> defined vs what GraphX can support. We should discuss those on a
>> case-by-case basis.
>>
>>
>> On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott 
>> wrote:
>>
>>> I think its best to look to existing standard rather then try to make
>>> your own. Of course small additions would need to be added to make it
>>> valuable for the Spark community, like a method similar to Gremlin's
>>> 'table' function, that produces an RDD instead.
>>> But there may be a lot of extra code and data structures that would need
>>> to be added to make it work, and those may not be directly applicable to
>>> all GraphX users. I think it would be best run as a separate module/project
>>> that builds directly on top of GraphX.
>>>
>>> Kyle
>>>
>>>
>>>
>>> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon <
>>> brennon.y...@capitalone.com> wrote:
>>>
>>>> My personal 2c is that, since GraphX is just beginning to provide a
>>>> full featured graph API, I think it would be better to align with the
>>>> TinkerPop group rather than roll our own. In my mind the benefits out way
>>>> the detriments as follows:
>>>>
>>>> Benefits:
>>>> * GraphX gains the ability to become another core tenant within the
>>>> TinkerPop community allowing a more diverse group of users into the Spark
>>>> ecosystem.
>>>> * TinkerPop can continue to maintain and own a solid / feature-rich
>>>> graph API that has already been accepted by a wide audience, relieving the
>>>> pressure of “one off” API additions from the GraphX team.
>>>> * GraphX can demonstrate its ability to be a key player in the GraphDB
>>>> space sitting inline with other major distributions (Neo4j, Titan, etc.).
>>>> * Allows for the abstract graph traversal logic (query API) to be owned
>>>> and maintained by a group already proven on the topic.
>>>>
>>>> Drawbacks:
>>>> * GraphX doesn’t own the API for its graph query capability. This could
>>>> be seen as good or bad, but it might make GraphX-specific implementation
>>>> addition

Re: Implementing TinkerPop on top of GraphX

2014-11-07 Thread York, Brennon
I’m definitely onboard to help / take a portion of this work. I too am 
wondering what the proper discussion venue should be moving forward given 
Reynold’s remarks on a community project hosted outside Spark. If I’m 
understanding correctly my take would be:

1. to find a core group of developers to take on this work (Kyle, myself, ???)
2. build an initial implementation
3. iterate / discuss with the Spark community as we find discrepancies between 
GraphX and the Gremlin3 API’s
4. contribute back to the Spark community when complete

Does that seem like a sound plan or am I way off base here? Itching to work on 
this :)

From: Kyle Ellrott mailto:kellr...@soe.ucsc.edu>>
Date: Friday, November 7, 2014 at 10:59 AM
To: Reynold Xin mailto:r...@databricks.com>>
Cc: "York, Brennon" 
mailto:brennon.y...@capitalone.com>>, Kushal Datta 
mailto:kushal.da...@gmail.com>>, 
"dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>, Matthias Broecheler 
mailto:matth...@thinkaurelius.com>>
Subject: Re: Implementing TinkerPop on top of GraphX

Who here would be interested in helping to work on an implementation of the 
Tikerpop3 Gremlin API for Spark? Is this something that should continue in the 
Spark discussion group, or should it migrate to the Gremlin message group?

Reynold is right that there will be inherent mismatches in the APIs, and there 
will need to be some discussions with the GraphX group about the best way to 
go. One example would be edge ids. GraphX has vertex ids, but no explicit edges 
ids, while Gremlin has both. Edge ids could be put into the attr field, but 
then that means the user would have to explicitly subclass their edge attribute 
to the edge attribute interface. Is that worth doing, versus adding an id to 
everyones's edges?

Kyle


On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
Some form of graph querying support would be great to have. This can be a great 
community project hosted outside of Spark initially, both due to the maturity 
of the component itself as well as the maturity of query language standards 
(there isn't really a dominant standard for graph ql).

One thing is that GraphX API will need to evolve and probably need to provide 
more primitives in order to support the new ql implementation. There might also 
be inherent mismatches in the way the external API is defined vs what GraphX 
can support. We should discuss those on a case-by-case basis.


On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott 
mailto:kellr...@soe.ucsc.edu>> wrote:
I think its best to look to existing standard rather then try to make your own. 
Of course small additions would need to be added to make it valuable for the 
Spark community, like a method similar to Gremlin's 'table' function, that 
produces an RDD instead.
But there may be a lot of extra code and data structures that would need to be 
added to make it work, and those may not be directly applicable to all GraphX 
users. I think it would be best run as a separate module/project that builds 
directly on top of GraphX.

Kyle



On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon 
mailto:brennon.y...@capitalone.com>> wrote:
My personal 2c is that, since GraphX is just beginning to provide a full 
featured graph API, I think it would be better to align with the TinkerPop 
group rather than roll our own. In my mind the benefits out way the detriments 
as follows:

Benefits:
* GraphX gains the ability to become another core tenant within the TinkerPop 
community allowing a more diverse group of users into the Spark ecosystem.
* TinkerPop can continue to maintain and own a solid / feature-rich graph API 
that has already been accepted by a wide audience, relieving the pressure of 
“one off” API additions from the GraphX team.
* GraphX can demonstrate its ability to be a key player in the GraphDB space 
sitting inline with other major distributions (Neo4j, Titan, etc.).
* Allows for the abstract graph traversal logic (query API) to be owned and 
maintained by a group already proven on the topic.

Drawbacks:
* GraphX doesn’t own the API for its graph query capability. This could be seen 
as good or bad, but it might make GraphX-specific implementation additions more 
tricky (possibly). Also, GraphX will need to maintain the features described 
within the TinkerPop API as that might change in the future.

From: Kushal Datta mailto:kushal.da...@gmail.com>>
Date: Thursday, November 6, 2014 at 4:00 PM
To: "York, Brennon" 
mailto:brennon.y...@capitalone.com>>
Cc: Kyle Ellrott mailto:kellr...@soe.ucsc.edu>>, Reynold 
Xin mailto:r...@databricks.com>>, 
"dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>, Matthias Broecheler 
mailto:matth...@thinkaurelius.com>>

Subject: Re: Implementing TinkerPop on top of Gra

Re: Implementing TinkerPop on top of GraphX

2014-11-07 Thread Kyle Ellrott
Who here would be interested in helping to work on an implementation of the
Tikerpop3 Gremlin API for Spark? Is this something that should continue in
the Spark discussion group, or should it migrate to the Gremlin message
group?

Reynold is right that there will be inherent mismatches in the APIs, and
there will need to be some discussions with the GraphX group about the best
way to go. One example would be edge ids. GraphX has vertex ids, but no
explicit edges ids, while Gremlin has both. Edge ids could be put into the
attr field, but then that means the user would have to explicitly subclass
their edge attribute to the edge attribute interface. Is that worth doing,
versus adding an id to everyones's edges?

Kyle


On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin  wrote:

> Some form of graph querying support would be great to have. This can be a
> great community project hosted outside of Spark initially, both due to the
> maturity of the component itself as well as the maturity of query language
> standards (there isn't really a dominant standard for graph ql).
>
> One thing is that GraphX API will need to evolve and probably need to
> provide more primitives in order to support the new ql implementation.
> There might also be inherent mismatches in the way the external API is
> defined vs what GraphX can support. We should discuss those on a
> case-by-case basis.
>
>
> On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott 
> wrote:
>
>> I think its best to look to existing standard rather then try to make
>> your own. Of course small additions would need to be added to make it
>> valuable for the Spark community, like a method similar to Gremlin's
>> 'table' function, that produces an RDD instead.
>> But there may be a lot of extra code and data structures that would need
>> to be added to make it work, and those may not be directly applicable to
>> all GraphX users. I think it would be best run as a separate module/project
>> that builds directly on top of GraphX.
>>
>> Kyle
>>
>>
>>
>> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon <
>> brennon.y...@capitalone.com> wrote:
>>
>>> My personal 2c is that, since GraphX is just beginning to provide a full
>>> featured graph API, I think it would be better to align with the TinkerPop
>>> group rather than roll our own. In my mind the benefits out way the
>>> detriments as follows:
>>>
>>> Benefits:
>>> * GraphX gains the ability to become another core tenant within the
>>> TinkerPop community allowing a more diverse group of users into the Spark
>>> ecosystem.
>>> * TinkerPop can continue to maintain and own a solid / feature-rich
>>> graph API that has already been accepted by a wide audience, relieving the
>>> pressure of “one off” API additions from the GraphX team.
>>> * GraphX can demonstrate its ability to be a key player in the GraphDB
>>> space sitting inline with other major distributions (Neo4j, Titan, etc.).
>>> * Allows for the abstract graph traversal logic (query API) to be owned
>>> and maintained by a group already proven on the topic.
>>>
>>> Drawbacks:
>>> * GraphX doesn’t own the API for its graph query capability. This could
>>> be seen as good or bad, but it might make GraphX-specific implementation
>>> additions more tricky (possibly). Also, GraphX will need to maintain the
>>> features described within the TinkerPop API as that might change in the
>>> future.
>>>
>>> From: Kushal Datta 
>>> Date: Thursday, November 6, 2014 at 4:00 PM
>>> To: "York, Brennon" 
>>> Cc: Kyle Ellrott , Reynold Xin <
>>> r...@databricks.com>, "dev@spark.apache.org" ,
>>> Matthias Broecheler 
>>>
>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>
>>> Before we dive into the implementation details, what are the high level
>>> thoughts on Gremlin/GraphX? Scala already provides the procedural way to
>>> query graphs in GraphX today. So, today I can run
>>> g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3
>>> Gremlin, of course sans the useful operators that Gremlin offers such as
>>> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators
>>> to GraphX api's a better approach or should we extend the existing set of
>>> transformations/actions that GraphX already offers with the useful
>>> operators from Gremlin? For example, we add as(), loop() and dedup()
>>> methods in VertexRDD and EdgeRDD.
>>>
>>> Either way

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Reynold Xin
Some form of graph querying support would be great to have. This can be a
great community project hosted outside of Spark initially, both due to the
maturity of the component itself as well as the maturity of query language
standards (there isn't really a dominant standard for graph ql).

One thing is that GraphX API will need to evolve and probably need to
provide more primitives in order to support the new ql implementation.
There might also be inherent mismatches in the way the external API is
defined vs what GraphX can support. We should discuss those on a
case-by-case basis.


On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott  wrote:

> I think its best to look to existing standard rather then try to make your
> own. Of course small additions would need to be added to make it valuable
> for the Spark community, like a method similar to Gremlin's 'table'
> function, that produces an RDD instead.
> But there may be a lot of extra code and data structures that would need
> to be added to make it work, and those may not be directly applicable to
> all GraphX users. I think it would be best run as a separate module/project
> that builds directly on top of GraphX.
>
> Kyle
>
>
>
> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon  > wrote:
>
>> My personal 2c is that, since GraphX is just beginning to provide a full
>> featured graph API, I think it would be better to align with the TinkerPop
>> group rather than roll our own. In my mind the benefits out way the
>> detriments as follows:
>>
>> Benefits:
>> * GraphX gains the ability to become another core tenant within the
>> TinkerPop community allowing a more diverse group of users into the Spark
>> ecosystem.
>> * TinkerPop can continue to maintain and own a solid / feature-rich graph
>> API that has already been accepted by a wide audience, relieving the
>> pressure of “one off” API additions from the GraphX team.
>> * GraphX can demonstrate its ability to be a key player in the GraphDB
>> space sitting inline with other major distributions (Neo4j, Titan, etc.).
>> * Allows for the abstract graph traversal logic (query API) to be owned
>> and maintained by a group already proven on the topic.
>>
>> Drawbacks:
>> * GraphX doesn’t own the API for its graph query capability. This could
>> be seen as good or bad, but it might make GraphX-specific implementation
>> additions more tricky (possibly). Also, GraphX will need to maintain the
>> features described within the TinkerPop API as that might change in the
>> future.
>>
>> From: Kushal Datta 
>> Date: Thursday, November 6, 2014 at 4:00 PM
>> To: "York, Brennon" 
>> Cc: Kyle Ellrott , Reynold Xin <
>> r...@databricks.com>, "dev@spark.apache.org" ,
>> Matthias Broecheler 
>>
>> Subject: Re: Implementing TinkerPop on top of GraphX
>>
>> Before we dive into the implementation details, what are the high level
>> thoughts on Gremlin/GraphX? Scala already provides the procedural way to
>> query graphs in GraphX today. So, today I can run
>> g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3
>> Gremlin, of course sans the useful operators that Gremlin offers such as
>> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators
>> to GraphX api's a better approach or should we extend the existing set of
>> transformations/actions that GraphX already offers with the useful
>> operators from Gremlin? For example, we add as(), loop() and dedup()
>> methods in VertexRDD and EdgeRDD.
>>
>> Either way we get a desperately needed graph query interface in GraphX.
>>
>> On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon <
>> brennon.y...@capitalone.com> wrote:
>>
>>> This was my thought exactly with the TinkerPop3 release. Looks like, to
>>> move this forward, we’d need to implement gremlin-core per <
>>> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>.
>>> The real question lies in whether GraphX can only support the OLTP
>>> functionality, or if we can bake into it the OLAP requirements as well. At
>>> a first glance I believe we could create an entire OLAP system. If so, I
>>> believe we could do this in a set of parallel subtasks, those being the
>>> implementation of each of the individual API’s (Structure, Process, and, if
>>> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>>>
>>>
>>> From: Kyle Ellrott 
>>> Date: Thursday, November 6, 2014 at 12:10 PM
>>> To: Kushal Datta 
>>> Cc: Reynold Xin , "York, Brennon" <

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kyle Ellrott
I think its best to look to existing standard rather then try to make your
own. Of course small additions would need to be added to make it valuable
for the Spark community, like a method similar to Gremlin's 'table'
function, that produces an RDD instead.
But there may be a lot of extra code and data structures that would need to
be added to make it work, and those may not be directly applicable to all
GraphX users. I think it would be best run as a separate module/project
that builds directly on top of GraphX.

Kyle



On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon 
wrote:

> My personal 2c is that, since GraphX is just beginning to provide a full
> featured graph API, I think it would be better to align with the TinkerPop
> group rather than roll our own. In my mind the benefits out way the
> detriments as follows:
>
> Benefits:
> * GraphX gains the ability to become another core tenant within the
> TinkerPop community allowing a more diverse group of users into the Spark
> ecosystem.
> * TinkerPop can continue to maintain and own a solid / feature-rich graph
> API that has already been accepted by a wide audience, relieving the
> pressure of “one off” API additions from the GraphX team.
> * GraphX can demonstrate its ability to be a key player in the GraphDB
> space sitting inline with other major distributions (Neo4j, Titan, etc.).
> * Allows for the abstract graph traversal logic (query API) to be owned
> and maintained by a group already proven on the topic.
>
> Drawbacks:
> * GraphX doesn’t own the API for its graph query capability. This could be
> seen as good or bad, but it might make GraphX-specific implementation
> additions more tricky (possibly). Also, GraphX will need to maintain the
> features described within the TinkerPop API as that might change in the
> future.
>
> From: Kushal Datta 
> Date: Thursday, November 6, 2014 at 4:00 PM
> To: "York, Brennon" 
> Cc: Kyle Ellrott , Reynold Xin ,
> "dev@spark.apache.org" , Matthias Broecheler <
> matth...@thinkaurelius.com>
>
> Subject: Re: Implementing TinkerPop on top of GraphX
>
> Before we dive into the implementation details, what are the high level
> thoughts on Gremlin/GraphX? Scala already provides the procedural way to
> query graphs in GraphX today. So, today I can run
> g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3
> Gremlin, of course sans the useful operators that Gremlin offers such as
> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators
> to GraphX api's a better approach or should we extend the existing set of
> transformations/actions that GraphX already offers with the useful
> operators from Gremlin? For example, we add as(), loop() and dedup()
> methods in VertexRDD and EdgeRDD.
>
> Either way we get a desperately needed graph query interface in GraphX.
>
> On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon  > wrote:
>
>> This was my thought exactly with the TinkerPop3 release. Looks like, to
>> move this forward, we’d need to implement gremlin-core per <
>> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>. The
>> real question lies in whether GraphX can only support the OLTP
>> functionality, or if we can bake into it the OLAP requirements as well. At
>> a first glance I believe we could create an entire OLAP system. If so, I
>> believe we could do this in a set of parallel subtasks, those being the
>> implementation of each of the individual API’s (Structure, Process, and, if
>> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>>
>>
>> From: Kyle Ellrott 
>> Date: Thursday, November 6, 2014 at 12:10 PM
>> To: Kushal Datta 
>> Cc: Reynold Xin , "York, Brennon" <
>> brennon.y...@capitalone.com>, "dev@spark.apache.org" <
>> dev@spark.apache.org>, Matthias Broecheler 
>> Subject: Re: Implementing TinkerPop on top of GraphX
>>
>> I still have to dig into the Tinkerpop3 internals (I started my work long
>> before it had been released), but I can say that to get the Tinerpop2
>> Gremlin pipeline to work in the GraphX was a bit of a hack. The
>> whole Tinkerpop2 Gremlin design was based around streaming pipes of
>> data, rather then large distributed map-reduce operations. I had to hack
>> the pipes to aggregate all of the data and pass a single object wrapping
>> the GraphX RDDs down the pipes in a single go, rather then streaming it
>> element by element.
>> Just based on their description, Tinkerpop3 may be more amenable to the
>> Spark platform.
>>
>> Kyle
>>
>>
>> On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta 
>>

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread York, Brennon
My personal 2c is that, since GraphX is just beginning to provide a full 
featured graph API, I think it would be better to align with the TinkerPop 
group rather than roll our own. In my mind the benefits out way the detriments 
as follows:

Benefits:
* GraphX gains the ability to become another core tenant within the TinkerPop 
community allowing a more diverse group of users into the Spark ecosystem.
* TinkerPop can continue to maintain and own a solid / feature-rich graph API 
that has already been accepted by a wide audience, relieving the pressure of 
“one off” API additions from the GraphX team.
* GraphX can demonstrate its ability to be a key player in the GraphDB space 
sitting inline with other major distributions (Neo4j, Titan, etc.).
* Allows for the abstract graph traversal logic (query API) to be owned and 
maintained by a group already proven on the topic.

Drawbacks:
* GraphX doesn’t own the API for its graph query capability. This could be seen 
as good or bad, but it might make GraphX-specific implementation additions more 
tricky (possibly). Also, GraphX will need to maintain the features described 
within the TinkerPop API as that might change in the future.

From: Kushal Datta mailto:kushal.da...@gmail.com>>
Date: Thursday, November 6, 2014 at 4:00 PM
To: "York, Brennon" 
mailto:brennon.y...@capitalone.com>>
Cc: Kyle Ellrott mailto:kellr...@soe.ucsc.edu>>, Reynold 
Xin mailto:r...@databricks.com>>, 
"dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>, Matthias Broecheler 
mailto:matth...@thinkaurelius.com>>
Subject: Re: Implementing TinkerPop on top of GraphX

Before we dive into the implementation details, what are the high level 
thoughts on Gremlin/GraphX? Scala already provides the procedural way to query 
graphs in GraphX today. So, today I can run g.vertices().filter().join() 
queries as OLAP in GraphX just like Tinkerpop3 Gremlin, of course sans the 
useful operators that Gremlin offers such as outE, inE, loop, as, dedup, etc. 
In that case is mapping Gremlin operators to GraphX api's a better approach or 
should we extend the existing set of transformations/actions that GraphX 
already offers with the useful operators from Gremlin? For example, we add 
as(), loop() and dedup() methods in VertexRDD and EdgeRDD.

Either way we get a desperately needed graph query interface in GraphX.

On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon 
mailto:brennon.y...@capitalone.com>> wrote:
This was my thought exactly with the TinkerPop3 release. Looks like, to move 
this forward, we’d need to implement gremlin-core per 
<http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>. The real 
question lies in whether GraphX can only support the OLTP functionality, or if 
we can bake into it the OLAP requirements as well. At a first glance I believe 
we could create an entire OLAP system. If so, I believe we could do this in a 
set of parallel subtasks, those being the implementation of each of the 
individual API’s (Structure, Process, and, if OLAP, GraphComputer) necessary 
for gremlin-core. Thoughts?


From: Kyle Ellrott mailto:kellr...@soe.ucsc.edu>>
Date: Thursday, November 6, 2014 at 12:10 PM
To: Kushal Datta mailto:kushal.da...@gmail.com>>
Cc: Reynold Xin mailto:r...@databricks.com>>, "York, 
Brennon" mailto:brennon.y...@capitalone.com>>, 
"dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>, Matthias Broecheler 
mailto:matth...@thinkaurelius.com>>
Subject: Re: Implementing TinkerPop on top of GraphX

I still have to dig into the Tinkerpop3 internals (I started my work long 
before it had been released), but I can say that to get the Tinerpop2 Gremlin 
pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 
Gremlin design was based around streaming pipes of data, rather then large 
distributed map-reduce operations. I had to hack the pipes to aggregate all of 
the data and pass a single object wrapping the GraphX RDDs down the pipes in a 
single go, rather then streaming it element by element.
Just based on their description, Tinkerpop3 may be more amenable to the Spark 
platform.

Kyle


On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta 
mailto:kushal.da...@gmail.com>> wrote:
What do you guys think about the Tinkerpop3 Gremlin interface?
It has MapReduce to run Gremlin operators in a distributed manner and Giraph to 
execute vertex programs.

The Tinkpop3 is better suited for GraphX.

On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott 
mailto:kellr...@soe.ucsc.edu>> wrote:
I've taken a crack at implementing the TinkerPop Blueprints API in GraphX (
https://github.com/kellrott/sparkgraph ). I've also implemented portions of
the Gremlin Search Language and a Parquet based graph store.
I've been working out finalize some code details and putting tog

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kushal Datta
Before we dive into the implementation details, what are the high level
thoughts on Gremlin/GraphX? Scala already provides the procedural way to
query graphs in GraphX today. So, today I can run
g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3
Gremlin, of course sans the useful operators that Gremlin offers such as
outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators
to GraphX api's a better approach or should we extend the existing set of
transformations/actions that GraphX already offers with the useful
operators from Gremlin? For example, we add as(), loop() and dedup()
methods in VertexRDD and EdgeRDD.

Either way we get a desperately needed graph query interface in GraphX.

On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon 
wrote:

> This was my thought exactly with the TinkerPop3 release. Looks like, to
> move this forward, we’d need to implement gremlin-core per <
> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>. The
> real question lies in whether GraphX can only support the OLTP
> functionality, or if we can bake into it the OLAP requirements as well. At
> a first glance I believe we could create an entire OLAP system. If so, I
> believe we could do this in a set of parallel subtasks, those being the
> implementation of each of the individual API’s (Structure, Process, and, if
> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>
>
> From: Kyle Ellrott 
> Date: Thursday, November 6, 2014 at 12:10 PM
> To: Kushal Datta 
> Cc: Reynold Xin , "York, Brennon" <
> brennon.y...@capitalone.com>, "dev@spark.apache.org" ,
> Matthias Broecheler 
> Subject: Re: Implementing TinkerPop on top of GraphX
>
> I still have to dig into the Tinkerpop3 internals (I started my work long
> before it had been released), but I can say that to get the Tinerpop2
> Gremlin pipeline to work in the GraphX was a bit of a hack. The
> whole Tinkerpop2 Gremlin design was based around streaming pipes of
> data, rather then large distributed map-reduce operations. I had to hack
> the pipes to aggregate all of the data and pass a single object wrapping
> the GraphX RDDs down the pipes in a single go, rather then streaming it
> element by element.
> Just based on their description, Tinkerpop3 may be more amenable to the
> Spark platform.
>
> Kyle
>
>
> On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta 
> wrote:
>
>> What do you guys think about the Tinkerpop3 Gremlin interface?
>> It has MapReduce to run Gremlin operators in a distributed manner and
>> Giraph to execute vertex programs.
>>
>> The Tinkpop3 is better suited for GraphX.
>>
>> On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott 
>> wrote:
>>
>>> I've taken a crack at implementing the TinkerPop Blueprints API in
>>> GraphX (
>>> https://github.com/kellrott/sparkgraph ). I've also implemented
>>> portions of
>>> the Gremlin Search Language and a Parquet based graph store.
>>> I've been working out finalize some code details and putting together
>>> better code examples and documentation before I started telling people
>>> about it.
>>> But if you want to start looking at the code, I can answer any questions
>>> you have. And if you would like to contribute, I would really appreciate
>>> the help.
>>>
>>> Kyle
>>>
>>>
>>> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin 
>>> wrote:
>>>
>>> > cc Matthias
>>> >
>>> > In the past we talked with Matthias and there were some discussions
>>> about
>>> > this.
>>> >
>>> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
>>> > brennon.y...@capitalone.com>
>>> > wrote:
>>> >
>>> > > All, was wondering if there had been any discussion around this topic
>>> > yet?
>>> > > TinkerPop <https://github.com/tinkerpop> is a great abstraction for
>>> > graph
>>> > > databases and has been implemented across various graph database
>>> backends
>>> > > / gaining traction. Has anyone thought about integrating the
>>> TinkerPop
>>> > > framework with GraphX to enable GraphX as another backend? Not sure
>>> if
>>> > > this has been brought up or not, but would certainly volunteer to
>>> > > spearhead this effort if the community thinks it to be a good idea!
>>> > >
>>> > > As an aside, wasn¹t sure if this discussion should happen on the
>>> board
>>> > > here or on JIRA, bu

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kyle Ellrott
I think I've already done most of the work for the OLTP objects (Graph,
Element, Vertex, Edge, Properties) when implementing Tinkerpop2. Singleton
write operations, like addVertex/deleteEdge, were cached locally until a
read operation was requested, then the set of build operations where
parallelized into an RDD and merged with the existing graph.
Its not efficient for large numbers of operations, but it passes unit tests
and works for small graph tweaking.

OLAP stuff looks completely new, but considering they have a Giraph
implementation, it should be pretty straight forward.

Kyle


On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon 
wrote:

> This was my thought exactly with the TinkerPop3 release. Looks like, to
> move this forward, we’d need to implement gremlin-core per <
> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>. The
> real question lies in whether GraphX can only support the OLTP
> functionality, or if we can bake into it the OLAP requirements as well. At
> a first glance I believe we could create an entire OLAP system. If so, I
> believe we could do this in a set of parallel subtasks, those being the
> implementation of each of the individual API’s (Structure, Process, and, if
> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>
>
> From: Kyle Ellrott 
> Date: Thursday, November 6, 2014 at 12:10 PM
> To: Kushal Datta 
> Cc: Reynold Xin , "York, Brennon" <
> brennon.y...@capitalone.com>, "dev@spark.apache.org" ,
> Matthias Broecheler 
> Subject: Re: Implementing TinkerPop on top of GraphX
>
> I still have to dig into the Tinkerpop3 internals (I started my work long
> before it had been released), but I can say that to get the Tinerpop2
> Gremlin pipeline to work in the GraphX was a bit of a hack. The
> whole Tinkerpop2 Gremlin design was based around streaming pipes of
> data, rather then large distributed map-reduce operations. I had to hack
> the pipes to aggregate all of the data and pass a single object wrapping
> the GraphX RDDs down the pipes in a single go, rather then streaming it
> element by element.
> Just based on their description, Tinkerpop3 may be more amenable to the
> Spark platform.
>
> Kyle
>
>
> On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta 
> wrote:
>
>> What do you guys think about the Tinkerpop3 Gremlin interface?
>> It has MapReduce to run Gremlin operators in a distributed manner and
>> Giraph to execute vertex programs.
>>
>> The Tinkpop3 is better suited for GraphX.
>>
>> On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott 
>> wrote:
>>
>>> I've taken a crack at implementing the TinkerPop Blueprints API in
>>> GraphX (
>>> https://github.com/kellrott/sparkgraph ). I've also implemented
>>> portions of
>>> the Gremlin Search Language and a Parquet based graph store.
>>> I've been working out finalize some code details and putting together
>>> better code examples and documentation before I started telling people
>>> about it.
>>> But if you want to start looking at the code, I can answer any questions
>>> you have. And if you would like to contribute, I would really appreciate
>>> the help.
>>>
>>> Kyle
>>>
>>>
>>> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin 
>>> wrote:
>>>
>>> > cc Matthias
>>> >
>>> > In the past we talked with Matthias and there were some discussions
>>> about
>>> > this.
>>> >
>>> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
>>> > brennon.y...@capitalone.com>
>>> > wrote:
>>> >
>>> > > All, was wondering if there had been any discussion around this topic
>>> > yet?
>>> > > TinkerPop <https://github.com/tinkerpop> is a great abstraction for
>>> > graph
>>> > > databases and has been implemented across various graph database
>>> backends
>>> > > / gaining traction. Has anyone thought about integrating the
>>> TinkerPop
>>> > > framework with GraphX to enable GraphX as another backend? Not sure
>>> if
>>> > > this has been brought up or not, but would certainly volunteer to
>>> > > spearhead this effort if the community thinks it to be a good idea!
>>> > >
>>> > > As an aside, wasn¹t sure if this discussion should happen on the
>>> board
>>> > > here or on JIRA, but a made a ticket as well for reference:
>>> > > https://issues.apache.org/jira/browse/SPARK-4279
>>> > >
>>>

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread York, Brennon
This was my thought exactly with the TinkerPop3 release. Looks like, to move 
this forward, we’d need to implement gremlin-core per 
<http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>. The real 
question lies in whether GraphX can only support the OLTP functionality, or if 
we can bake into it the OLAP requirements as well. At a first glance I believe 
we could create an entire OLAP system. If so, I believe we could do this in a 
set of parallel subtasks, those being the implementation of each of the 
individual API’s (Structure, Process, and, if OLAP, GraphComputer) necessary 
for gremlin-core. Thoughts?


From: Kyle Ellrott mailto:kellr...@soe.ucsc.edu>>
Date: Thursday, November 6, 2014 at 12:10 PM
To: Kushal Datta mailto:kushal.da...@gmail.com>>
Cc: Reynold Xin mailto:r...@databricks.com>>, "York, 
Brennon" mailto:brennon.y...@capitalone.com>>, 
"dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>, Matthias Broecheler 
mailto:matth...@thinkaurelius.com>>
Subject: Re: Implementing TinkerPop on top of GraphX

I still have to dig into the Tinkerpop3 internals (I started my work long 
before it had been released), but I can say that to get the Tinerpop2 Gremlin 
pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 
Gremlin design was based around streaming pipes of data, rather then large 
distributed map-reduce operations. I had to hack the pipes to aggregate all of 
the data and pass a single object wrapping the GraphX RDDs down the pipes in a 
single go, rather then streaming it element by element.
Just based on their description, Tinkerpop3 may be more amenable to the Spark 
platform.

Kyle


On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta 
mailto:kushal.da...@gmail.com>> wrote:
What do you guys think about the Tinkerpop3 Gremlin interface?
It has MapReduce to run Gremlin operators in a distributed manner and Giraph to 
execute vertex programs.

The Tinkpop3 is better suited for GraphX.

On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott 
mailto:kellr...@soe.ucsc.edu>> wrote:
I've taken a crack at implementing the TinkerPop Blueprints API in GraphX (
https://github.com/kellrott/sparkgraph ). I've also implemented portions of
the Gremlin Search Language and a Parquet based graph store.
I've been working out finalize some code details and putting together
better code examples and documentation before I started telling people
about it.
But if you want to start looking at the code, I can answer any questions
you have. And if you would like to contribute, I would really appreciate
the help.

Kyle


On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:

> cc Matthias
>
> In the past we talked with Matthias and there were some discussions about
> this.
>
> On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
> brennon.y...@capitalone.com<mailto:brennon.y...@capitalone.com>>
> wrote:
>
> > All, was wondering if there had been any discussion around this topic
> yet?
> > TinkerPop <https://github.com/tinkerpop> is a great abstraction for
> graph
> > databases and has been implemented across various graph database backends
> > / gaining traction. Has anyone thought about integrating the TinkerPop
> > framework with GraphX to enable GraphX as another backend? Not sure if
> > this has been brought up or not, but would certainly volunteer to
> > spearhead this effort if the community thinks it to be a good idea!
> >
> > As an aside, wasn¹t sure if this discussion should happen on the board
> > here or on JIRA, but a made a ticket as well for reference:
> > https://issues.apache.org/jira/browse/SPARK-4279
> >
> > 
> >
> > The information contained in this e-mail is confidential and/or
> > proprietary to Capital One and/or its affiliates. The information
> > transmitted herewith is intended only for use by the individual or entity
> > to which it is addressed.  If the reader of this message is not the
> > intended recipient, you are hereby notified that any review,
> > retransmission, dissemination, distribution, copying or other use of, or
> > taking of any action in reliance upon this information is strictly
> > prohibited. If you have received this communication in error, please
> > contact the sender and delete the material from your computer.
> >
> >
> > -
> > To unsubscribe, e-mail: 
> > dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: 
> > dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>
> >

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kyle Ellrott
I still have to dig into the Tinkerpop3 internals (I started my work long
before it had been released), but I can say that to get the Tinerpop2
Gremlin pipeline to work in the GraphX was a bit of a hack. The
whole Tinkerpop2 Gremlin design was based around streaming pipes of
data, rather then large distributed map-reduce operations. I had to hack
the pipes to aggregate all of the data and pass a single object wrapping
the GraphX RDDs down the pipes in a single go, rather then streaming it
element by element.
Just based on their description, Tinkerpop3 may be more amenable to the
Spark platform.

Kyle


On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta 
wrote:

> What do you guys think about the Tinkerpop3 Gremlin interface?
> It has MapReduce to run Gremlin operators in a distributed manner and
> Giraph to execute vertex programs.
>
> The Tinkpop3 is better suited for GraphX.
>
> On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott 
> wrote:
>
>> I've taken a crack at implementing the TinkerPop Blueprints API in GraphX
>> (
>> https://github.com/kellrott/sparkgraph ). I've also implemented portions
>> of
>> the Gremlin Search Language and a Parquet based graph store.
>> I've been working out finalize some code details and putting together
>> better code examples and documentation before I started telling people
>> about it.
>> But if you want to start looking at the code, I can answer any questions
>> you have. And if you would like to contribute, I would really appreciate
>> the help.
>>
>> Kyle
>>
>>
>> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin  wrote:
>>
>> > cc Matthias
>> >
>> > In the past we talked with Matthias and there were some discussions
>> about
>> > this.
>> >
>> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
>> > brennon.y...@capitalone.com>
>> > wrote:
>> >
>> > > All, was wondering if there had been any discussion around this topic
>> > yet?
>> > > TinkerPop  is a great abstraction for
>> > graph
>> > > databases and has been implemented across various graph database
>> backends
>> > > / gaining traction. Has anyone thought about integrating the TinkerPop
>> > > framework with GraphX to enable GraphX as another backend? Not sure if
>> > > this has been brought up or not, but would certainly volunteer to
>> > > spearhead this effort if the community thinks it to be a good idea!
>> > >
>> > > As an aside, wasn¹t sure if this discussion should happen on the board
>> > > here or on JIRA, but a made a ticket as well for reference:
>> > > https://issues.apache.org/jira/browse/SPARK-4279
>> > >
>> > > 
>> > >
>> > > The information contained in this e-mail is confidential and/or
>> > > proprietary to Capital One and/or its affiliates. The information
>> > > transmitted herewith is intended only for use by the individual or
>> entity
>> > > to which it is addressed.  If the reader of this message is not the
>> > > intended recipient, you are hereby notified that any review,
>> > > retransmission, dissemination, distribution, copying or other use of,
>> or
>> > > taking of any action in reliance upon this information is strictly
>> > > prohibited. If you have received this communication in error, please
>> > > contact the sender and delete the material from your computer.
>> > >
>> > >
>> > > -
>> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > > For additional commands, e-mail: dev-h...@spark.apache.org
>> > >
>> > >
>> >
>>
>
>


Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread andy petrella
Great stuffs!
I've got some thoughts about that, and I was wondering if it would be first
interesting to have something like for spark-core (let's say):
0/ Core API offering basic (or advanced → HeLP) primitives
1/ catalyst optimizer for a text base system (SPARQL, Cypher, custom SQL3,
whatnot)
2/ adequate DSL layer on top (à la LinQ)

my2¢


aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]



On Thu, Nov 6, 2014 at 8:48 PM, Kyle Ellrott  wrote:

> I've taken a crack at implementing the TinkerPop Blueprints API in GraphX (
> https://github.com/kellrott/sparkgraph ). I've also implemented portions
> of
> the Gremlin Search Language and a Parquet based graph store.
> I've been working out finalize some code details and putting together
> better code examples and documentation before I started telling people
> about it.
> But if you want to start looking at the code, I can answer any questions
> you have. And if you would like to contribute, I would really appreciate
> the help.
>
> Kyle
>
>
> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin  wrote:
>
> > cc Matthias
> >
> > In the past we talked with Matthias and there were some discussions about
> > this.
> >
> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
> > brennon.y...@capitalone.com>
> > wrote:
> >
> > > All, was wondering if there had been any discussion around this topic
> > yet?
> > > TinkerPop  is a great abstraction for
> > graph
> > > databases and has been implemented across various graph database
> backends
> > > / gaining traction. Has anyone thought about integrating the TinkerPop
> > > framework with GraphX to enable GraphX as another backend? Not sure if
> > > this has been brought up or not, but would certainly volunteer to
> > > spearhead this effort if the community thinks it to be a good idea!
> > >
> > > As an aside, wasn¹t sure if this discussion should happen on the board
> > > here or on JIRA, but a made a ticket as well for reference:
> > > https://issues.apache.org/jira/browse/SPARK-4279
> > >
> > > 
> > >
> > > The information contained in this e-mail is confidential and/or
> > > proprietary to Capital One and/or its affiliates. The information
> > > transmitted herewith is intended only for use by the individual or
> entity
> > > to which it is addressed.  If the reader of this message is not the
> > > intended recipient, you are hereby notified that any review,
> > > retransmission, dissemination, distribution, copying or other use of,
> or
> > > taking of any action in reliance upon this information is strictly
> > > prohibited. If you have received this communication in error, please
> > > contact the sender and delete the material from your computer.
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> > >
> >
>


Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kushal Datta
What do you guys think about the Tinkerpop3 Gremlin interface?
It has MapReduce to run Gremlin operators in a distributed manner and
Giraph to execute vertex programs.

The Tinkpop3 is better suited for GraphX.

On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott  wrote:

> I've taken a crack at implementing the TinkerPop Blueprints API in GraphX (
> https://github.com/kellrott/sparkgraph ). I've also implemented portions
> of
> the Gremlin Search Language and a Parquet based graph store.
> I've been working out finalize some code details and putting together
> better code examples and documentation before I started telling people
> about it.
> But if you want to start looking at the code, I can answer any questions
> you have. And if you would like to contribute, I would really appreciate
> the help.
>
> Kyle
>
>
> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin  wrote:
>
> > cc Matthias
> >
> > In the past we talked with Matthias and there were some discussions about
> > this.
> >
> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
> > brennon.y...@capitalone.com>
> > wrote:
> >
> > > All, was wondering if there had been any discussion around this topic
> > yet?
> > > TinkerPop  is a great abstraction for
> > graph
> > > databases and has been implemented across various graph database
> backends
> > > / gaining traction. Has anyone thought about integrating the TinkerPop
> > > framework with GraphX to enable GraphX as another backend? Not sure if
> > > this has been brought up or not, but would certainly volunteer to
> > > spearhead this effort if the community thinks it to be a good idea!
> > >
> > > As an aside, wasn¹t sure if this discussion should happen on the board
> > > here or on JIRA, but a made a ticket as well for reference:
> > > https://issues.apache.org/jira/browse/SPARK-4279
> > >
> > > 
> > >
> > > The information contained in this e-mail is confidential and/or
> > > proprietary to Capital One and/or its affiliates. The information
> > > transmitted herewith is intended only for use by the individual or
> entity
> > > to which it is addressed.  If the reader of this message is not the
> > > intended recipient, you are hereby notified that any review,
> > > retransmission, dissemination, distribution, copying or other use of,
> or
> > > taking of any action in reliance upon this information is strictly
> > > prohibited. If you have received this communication in error, please
> > > contact the sender and delete the material from your computer.
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> > >
> >
>


Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kyle Ellrott
I've taken a crack at implementing the TinkerPop Blueprints API in GraphX (
https://github.com/kellrott/sparkgraph ). I've also implemented portions of
the Gremlin Search Language and a Parquet based graph store.
I've been working out finalize some code details and putting together
better code examples and documentation before I started telling people
about it.
But if you want to start looking at the code, I can answer any questions
you have. And if you would like to contribute, I would really appreciate
the help.

Kyle


On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin  wrote:

> cc Matthias
>
> In the past we talked with Matthias and there were some discussions about
> this.
>
> On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
> brennon.y...@capitalone.com>
> wrote:
>
> > All, was wondering if there had been any discussion around this topic
> yet?
> > TinkerPop  is a great abstraction for
> graph
> > databases and has been implemented across various graph database backends
> > / gaining traction. Has anyone thought about integrating the TinkerPop
> > framework with GraphX to enable GraphX as another backend? Not sure if
> > this has been brought up or not, but would certainly volunteer to
> > spearhead this effort if the community thinks it to be a good idea!
> >
> > As an aside, wasn¹t sure if this discussion should happen on the board
> > here or on JIRA, but a made a ticket as well for reference:
> > https://issues.apache.org/jira/browse/SPARK-4279
> >
> > 
> >
> > The information contained in this e-mail is confidential and/or
> > proprietary to Capital One and/or its affiliates. The information
> > transmitted herewith is intended only for use by the individual or entity
> > to which it is addressed.  If the reader of this message is not the
> > intended recipient, you are hereby notified that any review,
> > retransmission, dissemination, distribution, copying or other use of, or
> > taking of any action in reliance upon this information is strictly
> > prohibited. If you have received this communication in error, please
> > contact the sender and delete the material from your computer.
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Reynold Xin
cc Matthias

In the past we talked with Matthias and there were some discussions about
this.

On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon 
wrote:

> All, was wondering if there had been any discussion around this topic yet?
> TinkerPop  is a great abstraction for graph
> databases and has been implemented across various graph database backends
> / gaining traction. Has anyone thought about integrating the TinkerPop
> framework with GraphX to enable GraphX as another backend? Not sure if
> this has been brought up or not, but would certainly volunteer to
> spearhead this effort if the community thinks it to be a good idea!
>
> As an aside, wasn¹t sure if this discussion should happen on the board
> here or on JIRA, but a made a ticket as well for reference:
> https://issues.apache.org/jira/browse/SPARK-4279
>
> 
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>