Re: Implementing TinkerPop on top of GraphX

Kushal Datta Thu, 20 Nov 2014 11:01:05 -0800

I have also added a graphx-gremlin module in the Tinkerpop3 codebase. Right
now a GraphX graph can be instantiated from the Gremlin command line (in a
similar manner a Giraph graph is instantiated) and the g.V().count()
function calls the count() method on RDDs.
Please check out the code in:
https://github.com/kdatta/tinkerpop3/tree/graphx-gremlin


@Kyle, I'm off for a few days till Thanksgiving. After that I'll try the
EdgeIterator in this code.

Thanks,
-Kushal.

On Tue, Nov 18, 2014 at 2:23 PM, Kyle Ellrott <[email protected]> wrote:

> The new Tinkerpop3 API was different enough from V2, that it was worth
> starting a new implementation rather then trying to completely refactor my
> old code.
> I've started a new project: https://github.com/kellrott/spark-gremlin
> which compiles and runs the first set of unit tests (which it completely
> fails). Most of the classes are structured in the same way they are in the
> Gigraph implementation. There isn't much actual GraphX code in the project
> yet, just a framework to start working in.
> Hopefully this will keep the conversation going.
>
> Kyle
>
> On Fri, Nov 7, 2014 at 11:17 AM, Kushal Datta <[email protected]>
> wrote:
>
>> I think if we are going to use GraphX as the query engine in Tinkerpop3,
>> then the Tinkerpop3 community is the right platform to further the
>> discussion.
>>
>> The reason I asked the question on improving APIs in GraphX is because
>> why only Gremlin, any graph DSL can exploit the GraphX APIs. Cypher has
>> some good subgraph matching query interfaces which I believe can be
>> distributed using GraphX apis.
>>
>> An edge ID is an internal attribute of the edge generated automatically,
>> mostly hidden from the user. That's why adding it as an edge property might
>> not be a good idea. There are several little differences like this. E.g. in
>> Tinkerpop3 Gremlin implementation for Giraph, only vertex programs are
>> executed in Giraph directly. The side-effect operators are mapped to
>> Map-Reduce functions. In the implementation we are talking about, all of
>> these operations can be done within GraphX. I will be interested to
>> co-develop the query engine.
>>
>> @Reynold, I agree. And as I said earlier, the apis should be designed in
>> such a way that it can be used in any Graph DSL.
>>
>> On Fri, Nov 7, 2014 at 10:59 AM, Kyle Ellrott <[email protected]>
>> wrote:
>>
>>> Who here would be interested in helping to work on an implementation of
>>> the Tikerpop3 Gremlin API for Spark? Is this something that should continue
>>> in the Spark discussion group, or should it migrate to the Gremlin message
>>> group?
>>>
>>> Reynold is right that there will be inherent mismatches in the APIs, and
>>> there will need to be some discussions with the GraphX group about the best
>>> way to go. One example would be edge ids. GraphX has vertex ids, but no
>>> explicit edges ids, while Gremlin has both. Edge ids could be put into the
>>> attr field, but then that means the user would have to explicitly subclass
>>> their edge attribute to the edge attribute interface. Is that worth doing,
>>> versus adding an id to everyones's edges?
>>>
>>> Kyle
>>>
>>>
>>> On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin <[email protected]> wrote:
>>>
>>>> Some form of graph querying support would be great to have. This can be
>>>> a great community project hosted outside of Spark initially, both due to
>>>> the maturity of the component itself as well as the maturity of query
>>>> language standards (there isn't really a dominant standard for graph ql).
>>>>
>>>> One thing is that GraphX API will need to evolve and probably need to
>>>> provide more primitives in order to support the new ql implementation.
>>>> There might also be inherent mismatches in the way the external API is
>>>> defined vs what GraphX can support. We should discuss those on a
>>>> case-by-case basis.
>>>>
>>>>
>>>> On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott <[email protected]>
>>>> wrote:
>>>>
>>>>> I think its best to look to existing standard rather then try to make
>>>>> your own. Of course small additions would need to be added to make it
>>>>> valuable for the Spark community, like a method similar to Gremlin's
>>>>> 'table' function, that produces an RDD instead.
>>>>> But there may be a lot of extra code and data structures that would
>>>>> need to be added to make it work, and those may not be directly applicable
>>>>> to all GraphX users. I think it would be best run as a separate
>>>>> module/project that builds directly on top of GraphX.
>>>>>
>>>>> Kyle
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> My personal 2c is that, since GraphX is just beginning to provide a
>>>>>> full featured graph API, I think it would be better to align with the
>>>>>> TinkerPop group rather than roll our own. In my mind the benefits out way
>>>>>> the detriments as follows:
>>>>>>
>>>>>> Benefits:
>>>>>> * GraphX gains the ability to become another core tenant within the
>>>>>> TinkerPop community allowing a more diverse group of users into the Spark
>>>>>> ecosystem.
>>>>>> * TinkerPop can continue to maintain and own a solid / feature-rich
>>>>>> graph API that has already been accepted by a wide audience, relieving 
>>>>>> the
>>>>>> pressure of “one off” API additions from the GraphX team.
>>>>>> * GraphX can demonstrate its ability to be a key player in the
>>>>>> GraphDB space sitting inline with other major distributions (Neo4j, 
>>>>>> Titan,
>>>>>> etc.).
>>>>>> * Allows for the abstract graph traversal logic (query API) to be
>>>>>> owned and maintained by a group already proven on the topic.
>>>>>>
>>>>>> Drawbacks:
>>>>>> * GraphX doesn’t own the API for its graph query capability. This
>>>>>> could be seen as good or bad, but it might make GraphX-specific
>>>>>> implementation additions more tricky (possibly). Also, GraphX will need 
>>>>>> to
>>>>>> maintain the features described within the TinkerPop API as that might
>>>>>> change in the future.
>>>>>>
>>>>>> From: Kushal Datta <[email protected]>
>>>>>> Date: Thursday, November 6, 2014 at 4:00 PM
>>>>>> To: "York, Brennon" <[email protected]>
>>>>>> Cc: Kyle Ellrott <[email protected]>, Reynold Xin <
>>>>>> [email protected]>, "[email protected]" <[email protected]>,
>>>>>> Matthias Broecheler <[email protected]>
>>>>>>
>>>>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>>>>
>>>>>> Before we dive into the implementation details, what are the high
>>>>>> level thoughts on Gremlin/GraphX? Scala already provides the procedural 
>>>>>> way
>>>>>> to query graphs in GraphX today. So, today I can run
>>>>>> g.vertices().filter().join() queries as OLAP in GraphX just like 
>>>>>> Tinkerpop3
>>>>>> Gremlin, of course sans the useful operators that Gremlin offers such as
>>>>>> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin 
>>>>>> operators
>>>>>> to GraphX api's a better approach or should we extend the existing set of
>>>>>> transformations/actions that GraphX already offers with the useful
>>>>>> operators from Gremlin? For example, we add as(), loop() and dedup()
>>>>>> methods in VertexRDD and EdgeRDD.
>>>>>>
>>>>>> Either way we get a desperately needed graph query interface in
>>>>>> GraphX.
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> This was my thought exactly with the TinkerPop3 release. Looks like,
>>>>>>> to move this forward, we’d need to implement gremlin-core per <
>>>>>>> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>.
>>>>>>> The real question lies in whether GraphX can only support the OLTP
>>>>>>> functionality, or if we can bake into it the OLAP requirements as well. 
>>>>>>> At
>>>>>>> a first glance I believe we could create an entire OLAP system. If so, I
>>>>>>> believe we could do this in a set of parallel subtasks, those being the
>>>>>>> implementation of each of the individual API’s (Structure, Process, 
>>>>>>> and, if
>>>>>>> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>>>>>>>
>>>>>>>
>>>>>>> From: Kyle Ellrott <[email protected]>
>>>>>>> Date: Thursday, November 6, 2014 at 12:10 PM
>>>>>>> To: Kushal Datta <[email protected]>
>>>>>>> Cc: Reynold Xin <[email protected]>, "York, Brennon" <
>>>>>>> [email protected]>, "[email protected]" <
>>>>>>> [email protected]>, Matthias Broecheler <
>>>>>>> [email protected]>
>>>>>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>>>>>
>>>>>>> I still have to dig into the Tinkerpop3 internals (I started my work
>>>>>>> long before it had been released), but I can say that to get the 
>>>>>>> Tinerpop2
>>>>>>> Gremlin pipeline to work in the GraphX was a bit of a hack. The
>>>>>>> whole Tinkerpop2 Gremlin design was based around streaming pipes of
>>>>>>> data, rather then large distributed map-reduce operations. I had to hack
>>>>>>> the pipes to aggregate all of the data and pass a single object wrapping
>>>>>>> the GraphX RDDs down the pipes in a single go, rather then streaming it
>>>>>>> element by element.
>>>>>>> Just based on their description, Tinkerpop3 may be more amenable to
>>>>>>> the Spark platform.
>>>>>>>
>>>>>>> Kyle
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> What do you guys think about the Tinkerpop3 Gremlin interface?
>>>>>>>> It has MapReduce to run Gremlin operators in a distributed manner
>>>>>>>> and Giraph to execute vertex programs.
>>>>>>>>
>>>>>>>> The Tinkpop3 is better suited for GraphX.
>>>>>>>>
>>>>>>>> On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I've taken a crack at implementing the TinkerPop Blueprints API in
>>>>>>>>> GraphX (
>>>>>>>>> https://github.com/kellrott/sparkgraph ). I've also implemented
>>>>>>>>> portions of
>>>>>>>>> the Gremlin Search Language and a Parquet based graph store.
>>>>>>>>> I've been working out finalize some code details and putting
>>>>>>>>> together
>>>>>>>>> better code examples and documentation before I started telling
>>>>>>>>> people
>>>>>>>>> about it.
>>>>>>>>> But if you want to start looking at the code, I can answer any
>>>>>>>>> questions
>>>>>>>>> you have. And if you would like to contribute, I would really
>>>>>>>>> appreciate
>>>>>>>>> the help.
>>>>>>>>>
>>>>>>>>> Kyle
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > cc Matthias
>>>>>>>>> >
>>>>>>>>> > In the past we talked with Matthias and there were some
>>>>>>>>> discussions about
>>>>>>>>> > this.
>>>>>>>>> >
>>>>>>>>> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
>>>>>>>>> > [email protected]>
>>>>>>>>> > wrote:
>>>>>>>>> >
>>>>>>>>> > > All, was wondering if there had been any discussion around
>>>>>>>>> this topic
>>>>>>>>> > yet?
>>>>>>>>> > > TinkerPop <https://github.com/tinkerpop> is a great
>>>>>>>>> abstraction for
>>>>>>>>> > graph
>>>>>>>>> > > databases and has been implemented across various graph
>>>>>>>>> database backends
>>>>>>>>> > > / gaining traction. Has anyone thought about integrating the
>>>>>>>>> TinkerPop
>>>>>>>>> > > framework with GraphX to enable GraphX as another backend? Not
>>>>>>>>> sure if
>>>>>>>>> > > this has been brought up or not, but would certainly volunteer
>>>>>>>>> to
>>>>>>>>> > > spearhead this effort if the community thinks it to be a good
>>>>>>>>> idea!
>>>>>>>>> > >
>>>>>>>>> > > As an aside, wasn¹t sure if this discussion should happen on
>>>>>>>>> the board
>>>>>>>>> > > here or on JIRA, but a made a ticket as well for reference:
>>>>>>>>> > > https://issues.apache.org/jira/browse/SPARK-4279
>>>>>>>>> > >
>>>>>>>>> > > ________________________________________________________
>>>>>>>>> > >
>>>>>>>>> > > The information contained in this e-mail is confidential and/or
>>>>>>>>> > > proprietary to Capital One and/or its affiliates. The
>>>>>>>>> information
>>>>>>>>> > > transmitted herewith is intended only for use by the
>>>>>>>>> individual or entity
>>>>>>>>> > > to which it is addressed.  If the reader of this message is
>>>>>>>>> not the
>>>>>>>>> > > intended recipient, you are hereby notified that any review,
>>>>>>>>> > > retransmission, dissemination, distribution, copying or other
>>>>>>>>> use of, or
>>>>>>>>> > > taking of any action in reliance upon this information is
>>>>>>>>> strictly
>>>>>>>>> > > prohibited. If you have received this communication in error,
>>>>>>>>> please
>>>>>>>>> > > contact the sender and delete the material from your computer.
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> > > To unsubscribe, e-mail: [email protected]
>>>>>>>>> > > For additional commands, e-mail: [email protected]
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> The information contained in this e-mail is confidential and/or
>>>>>>> proprietary to Capital One and/or its affiliates. The information
>>>>>>> transmitted herewith is intended only for use by the individual or 
>>>>>>> entity
>>>>>>> to which it is addressed.  If the reader of this message is not the
>>>>>>> intended recipient, you are hereby notified that any review,
>>>>>>> retransmission, dissemination, distribution, copying or other use of, or
>>>>>>> taking of any action in reliance upon this information is strictly
>>>>>>> prohibited. If you have received this communication in error, please
>>>>>>> contact the sender and delete the material from your computer.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> The information contained in this e-mail is confidential and/or
>>>>>> proprietary to Capital One and/or its affiliates. The information
>>>>>> transmitted herewith is intended only for use by the individual or entity
>>>>>> to which it is addressed.  If the reader of this message is not the
>>>>>> intended recipient, you are hereby notified that any review,
>>>>>> retransmission, dissemination, distribution, copying or other use of, or
>>>>>> taking of any action in reliance upon this information is strictly
>>>>>> prohibited. If you have received this communication in error, please
>>>>>> contact the sender and delete the material from your computer.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Implementing TinkerPop on top of GraphX

Reply via email to