I have also added a graphx-gremlin module in the Tinkerpop3 codebase. Right now a GraphX graph can be instantiated from the Gremlin command line (in a similar manner a Giraph graph is instantiated) and the g.V().count() function calls the count() method on RDDs. Please check out the code in: https://github.com/kdatta/tinkerpop3/tree/graphx-gremlin
@Kyle, I'm off for a few days till Thanksgiving. After that I'll try the EdgeIterator in this code. Thanks, -Kushal. On Tue, Nov 18, 2014 at 2:23 PM, Kyle Ellrott <kellr...@soe.ucsc.edu> wrote: > The new Tinkerpop3 API was different enough from V2, that it was worth > starting a new implementation rather then trying to completely refactor my > old code. > I've started a new project: https://github.com/kellrott/spark-gremlin > which compiles and runs the first set of unit tests (which it completely > fails). Most of the classes are structured in the same way they are in the > Gigraph implementation. There isn't much actual GraphX code in the project > yet, just a framework to start working in. > Hopefully this will keep the conversation going. > > Kyle > > On Fri, Nov 7, 2014 at 11:17 AM, Kushal Datta <kushal.da...@gmail.com> > wrote: > >> I think if we are going to use GraphX as the query engine in Tinkerpop3, >> then the Tinkerpop3 community is the right platform to further the >> discussion. >> >> The reason I asked the question on improving APIs in GraphX is because >> why only Gremlin, any graph DSL can exploit the GraphX APIs. Cypher has >> some good subgraph matching query interfaces which I believe can be >> distributed using GraphX apis. >> >> An edge ID is an internal attribute of the edge generated automatically, >> mostly hidden from the user. That's why adding it as an edge property might >> not be a good idea. There are several little differences like this. E.g. in >> Tinkerpop3 Gremlin implementation for Giraph, only vertex programs are >> executed in Giraph directly. The side-effect operators are mapped to >> Map-Reduce functions. In the implementation we are talking about, all of >> these operations can be done within GraphX. I will be interested to >> co-develop the query engine. >> >> @Reynold, I agree. And as I said earlier, the apis should be designed in >> such a way that it can be used in any Graph DSL. >> >> On Fri, Nov 7, 2014 at 10:59 AM, Kyle Ellrott <kellr...@soe.ucsc.edu> >> wrote: >> >>> Who here would be interested in helping to work on an implementation of >>> the Tikerpop3 Gremlin API for Spark? Is this something that should continue >>> in the Spark discussion group, or should it migrate to the Gremlin message >>> group? >>> >>> Reynold is right that there will be inherent mismatches in the APIs, and >>> there will need to be some discussions with the GraphX group about the best >>> way to go. One example would be edge ids. GraphX has vertex ids, but no >>> explicit edges ids, while Gremlin has both. Edge ids could be put into the >>> attr field, but then that means the user would have to explicitly subclass >>> their edge attribute to the edge attribute interface. Is that worth doing, >>> versus adding an id to everyones's edges? >>> >>> Kyle >>> >>> >>> On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin <r...@databricks.com> wrote: >>> >>>> Some form of graph querying support would be great to have. This can be >>>> a great community project hosted outside of Spark initially, both due to >>>> the maturity of the component itself as well as the maturity of query >>>> language standards (there isn't really a dominant standard for graph ql). >>>> >>>> One thing is that GraphX API will need to evolve and probably need to >>>> provide more primitives in order to support the new ql implementation. >>>> There might also be inherent mismatches in the way the external API is >>>> defined vs what GraphX can support. We should discuss those on a >>>> case-by-case basis. >>>> >>>> >>>> On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott <kellr...@soe.ucsc.edu> >>>> wrote: >>>> >>>>> I think its best to look to existing standard rather then try to make >>>>> your own. Of course small additions would need to be added to make it >>>>> valuable for the Spark community, like a method similar to Gremlin's >>>>> 'table' function, that produces an RDD instead. >>>>> But there may be a lot of extra code and data structures that would >>>>> need to be added to make it work, and those may not be directly applicable >>>>> to all GraphX users. I think it would be best run as a separate >>>>> module/project that builds directly on top of GraphX. >>>>> >>>>> Kyle >>>>> >>>>> >>>>> >>>>> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon < >>>>> brennon.y...@capitalone.com> wrote: >>>>> >>>>>> My personal 2c is that, since GraphX is just beginning to provide a >>>>>> full featured graph API, I think it would be better to align with the >>>>>> TinkerPop group rather than roll our own. In my mind the benefits out way >>>>>> the detriments as follows: >>>>>> >>>>>> Benefits: >>>>>> * GraphX gains the ability to become another core tenant within the >>>>>> TinkerPop community allowing a more diverse group of users into the Spark >>>>>> ecosystem. >>>>>> * TinkerPop can continue to maintain and own a solid / feature-rich >>>>>> graph API that has already been accepted by a wide audience, relieving >>>>>> the >>>>>> pressure of “one off” API additions from the GraphX team. >>>>>> * GraphX can demonstrate its ability to be a key player in the >>>>>> GraphDB space sitting inline with other major distributions (Neo4j, >>>>>> Titan, >>>>>> etc.). >>>>>> * Allows for the abstract graph traversal logic (query API) to be >>>>>> owned and maintained by a group already proven on the topic. >>>>>> >>>>>> Drawbacks: >>>>>> * GraphX doesn’t own the API for its graph query capability. This >>>>>> could be seen as good or bad, but it might make GraphX-specific >>>>>> implementation additions more tricky (possibly). Also, GraphX will need >>>>>> to >>>>>> maintain the features described within the TinkerPop API as that might >>>>>> change in the future. >>>>>> >>>>>> From: Kushal Datta <kushal.da...@gmail.com> >>>>>> Date: Thursday, November 6, 2014 at 4:00 PM >>>>>> To: "York, Brennon" <brennon.y...@capitalone.com> >>>>>> Cc: Kyle Ellrott <kellr...@soe.ucsc.edu>, Reynold Xin < >>>>>> r...@databricks.com>, "dev@spark.apache.org" <dev@spark.apache.org>, >>>>>> Matthias Broecheler <matth...@thinkaurelius.com> >>>>>> >>>>>> Subject: Re: Implementing TinkerPop on top of GraphX >>>>>> >>>>>> Before we dive into the implementation details, what are the high >>>>>> level thoughts on Gremlin/GraphX? Scala already provides the procedural >>>>>> way >>>>>> to query graphs in GraphX today. So, today I can run >>>>>> g.vertices().filter().join() queries as OLAP in GraphX just like >>>>>> Tinkerpop3 >>>>>> Gremlin, of course sans the useful operators that Gremlin offers such as >>>>>> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin >>>>>> operators >>>>>> to GraphX api's a better approach or should we extend the existing set of >>>>>> transformations/actions that GraphX already offers with the useful >>>>>> operators from Gremlin? For example, we add as(), loop() and dedup() >>>>>> methods in VertexRDD and EdgeRDD. >>>>>> >>>>>> Either way we get a desperately needed graph query interface in >>>>>> GraphX. >>>>>> >>>>>> On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon < >>>>>> brennon.y...@capitalone.com> wrote: >>>>>> >>>>>>> This was my thought exactly with the TinkerPop3 release. Looks like, >>>>>>> to move this forward, we’d need to implement gremlin-core per < >>>>>>> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>. >>>>>>> The real question lies in whether GraphX can only support the OLTP >>>>>>> functionality, or if we can bake into it the OLAP requirements as well. >>>>>>> At >>>>>>> a first glance I believe we could create an entire OLAP system. If so, I >>>>>>> believe we could do this in a set of parallel subtasks, those being the >>>>>>> implementation of each of the individual API’s (Structure, Process, >>>>>>> and, if >>>>>>> OLAP, GraphComputer) necessary for gremlin-core. Thoughts? >>>>>>> >>>>>>> >>>>>>> From: Kyle Ellrott <kellr...@soe.ucsc.edu> >>>>>>> Date: Thursday, November 6, 2014 at 12:10 PM >>>>>>> To: Kushal Datta <kushal.da...@gmail.com> >>>>>>> Cc: Reynold Xin <r...@databricks.com>, "York, Brennon" < >>>>>>> brennon.y...@capitalone.com>, "dev@spark.apache.org" < >>>>>>> dev@spark.apache.org>, Matthias Broecheler < >>>>>>> matth...@thinkaurelius.com> >>>>>>> Subject: Re: Implementing TinkerPop on top of GraphX >>>>>>> >>>>>>> I still have to dig into the Tinkerpop3 internals (I started my work >>>>>>> long before it had been released), but I can say that to get the >>>>>>> Tinerpop2 >>>>>>> Gremlin pipeline to work in the GraphX was a bit of a hack. The >>>>>>> whole Tinkerpop2 Gremlin design was based around streaming pipes of >>>>>>> data, rather then large distributed map-reduce operations. I had to hack >>>>>>> the pipes to aggregate all of the data and pass a single object wrapping >>>>>>> the GraphX RDDs down the pipes in a single go, rather then streaming it >>>>>>> element by element. >>>>>>> Just based on their description, Tinkerpop3 may be more amenable to >>>>>>> the Spark platform. >>>>>>> >>>>>>> Kyle >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta < >>>>>>> kushal.da...@gmail.com> wrote: >>>>>>> >>>>>>>> What do you guys think about the Tinkerpop3 Gremlin interface? >>>>>>>> It has MapReduce to run Gremlin operators in a distributed manner >>>>>>>> and Giraph to execute vertex programs. >>>>>>>> >>>>>>>> The Tinkpop3 is better suited for GraphX. >>>>>>>> >>>>>>>> On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott < >>>>>>>> kellr...@soe.ucsc.edu> wrote: >>>>>>>> >>>>>>>>> I've taken a crack at implementing the TinkerPop Blueprints API in >>>>>>>>> GraphX ( >>>>>>>>> https://github.com/kellrott/sparkgraph ). I've also implemented >>>>>>>>> portions of >>>>>>>>> the Gremlin Search Language and a Parquet based graph store. >>>>>>>>> I've been working out finalize some code details and putting >>>>>>>>> together >>>>>>>>> better code examples and documentation before I started telling >>>>>>>>> people >>>>>>>>> about it. >>>>>>>>> But if you want to start looking at the code, I can answer any >>>>>>>>> questions >>>>>>>>> you have. And if you would like to contribute, I would really >>>>>>>>> appreciate >>>>>>>>> the help. >>>>>>>>> >>>>>>>>> Kyle >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin <r...@databricks.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> > cc Matthias >>>>>>>>> > >>>>>>>>> > In the past we talked with Matthias and there were some >>>>>>>>> discussions about >>>>>>>>> > this. >>>>>>>>> > >>>>>>>>> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon < >>>>>>>>> > brennon.y...@capitalone.com> >>>>>>>>> > wrote: >>>>>>>>> > >>>>>>>>> > > All, was wondering if there had been any discussion around >>>>>>>>> this topic >>>>>>>>> > yet? >>>>>>>>> > > TinkerPop <https://github.com/tinkerpop> is a great >>>>>>>>> abstraction for >>>>>>>>> > graph >>>>>>>>> > > databases and has been implemented across various graph >>>>>>>>> database backends >>>>>>>>> > > / gaining traction. Has anyone thought about integrating the >>>>>>>>> TinkerPop >>>>>>>>> > > framework with GraphX to enable GraphX as another backend? Not >>>>>>>>> sure if >>>>>>>>> > > this has been brought up or not, but would certainly volunteer >>>>>>>>> to >>>>>>>>> > > spearhead this effort if the community thinks it to be a good >>>>>>>>> idea! >>>>>>>>> > > >>>>>>>>> > > As an aside, wasn¹t sure if this discussion should happen on >>>>>>>>> the board >>>>>>>>> > > here or on JIRA, but a made a ticket as well for reference: >>>>>>>>> > > https://issues.apache.org/jira/browse/SPARK-4279 >>>>>>>>> > > >>>>>>>>> > > ________________________________________________________ >>>>>>>>> > > >>>>>>>>> > > The information contained in this e-mail is confidential and/or >>>>>>>>> > > proprietary to Capital One and/or its affiliates. The >>>>>>>>> information >>>>>>>>> > > transmitted herewith is intended only for use by the >>>>>>>>> individual or entity >>>>>>>>> > > to which it is addressed. If the reader of this message is >>>>>>>>> not the >>>>>>>>> > > intended recipient, you are hereby notified that any review, >>>>>>>>> > > retransmission, dissemination, distribution, copying or other >>>>>>>>> use of, or >>>>>>>>> > > taking of any action in reliance upon this information is >>>>>>>>> strictly >>>>>>>>> > > prohibited. If you have received this communication in error, >>>>>>>>> please >>>>>>>>> > > contact the sender and delete the material from your computer. >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>> > > For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> ------------------------------ >>>>>>> >>>>>>> The information contained in this e-mail is confidential and/or >>>>>>> proprietary to Capital One and/or its affiliates. The information >>>>>>> transmitted herewith is intended only for use by the individual or >>>>>>> entity >>>>>>> to which it is addressed. If the reader of this message is not the >>>>>>> intended recipient, you are hereby notified that any review, >>>>>>> retransmission, dissemination, distribution, copying or other use of, or >>>>>>> taking of any action in reliance upon this information is strictly >>>>>>> prohibited. If you have received this communication in error, please >>>>>>> contact the sender and delete the material from your computer. >>>>>>> >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> The information contained in this e-mail is confidential and/or >>>>>> proprietary to Capital One and/or its affiliates. The information >>>>>> transmitted herewith is intended only for use by the individual or entity >>>>>> to which it is addressed. If the reader of this message is not the >>>>>> intended recipient, you are hereby notified that any review, >>>>>> retransmission, dissemination, distribution, copying or other use of, or >>>>>> taking of any action in reliance upon this information is strictly >>>>>> prohibited. If you have received this communication in error, please >>>>>> contact the sender and delete the material from your computer. >>>>>> >>>>> >>>>> >>>> >>> >> >