Re: [discuss] dropping Python 2.6 support

2016-01-04 Thread Kushal Datta
+1


Dr. Kushal Datta
Senior Research Scientist
Big Data Research & Pathfinding
Intel Corporation, USA.

On Mon, Jan 4, 2016 at 11:52 PM, Jean-Baptiste Onofré 
wrote:

> +1
>
> no problem for me to remove Python 2.6 in 2.0.
>
> Thanks
> Regards
> JB
>
>
> On 01/05/2016 08:17 AM, Reynold Xin wrote:
>
>> Does anybody here care about us dropping support for Python 2.6 in Spark
>> 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>> parsing) when compared with Python 2.7. Some libraries that Spark depend
>> on stopped supporting 2.6. We can still convince the library maintainers
>> to support 2.6, but it will be extra work. I'm curious if anybody still
>> uses Python 2.6 to run Spark.
>>
>> Thanks.
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Implementing TinkerPop on top of GraphX

2014-11-20 Thread Kushal Datta
I have also added a graphx-gremlin module in the Tinkerpop3 codebase. Right
now a GraphX graph can be instantiated from the Gremlin command line (in a
similar manner a Giraph graph is instantiated) and the g.V().count()
function calls the count() method on RDDs.
Please check out the code in:
https://github.com/kdatta/tinkerpop3/tree/graphx-gremlin

@Kyle, I'm off for a few days till Thanksgiving. After that I'll try the
EdgeIterator in this code.

Thanks,
-Kushal.

On Tue, Nov 18, 2014 at 2:23 PM, Kyle Ellrott  wrote:

> The new Tinkerpop3 API was different enough from V2, that it was worth
> starting a new implementation rather then trying to completely refactor my
> old code.
> I've started a new project: https://github.com/kellrott/spark-gremlin
> which compiles and runs the first set of unit tests (which it completely
> fails). Most of the classes are structured in the same way they are in the
> Gigraph implementation. There isn't much actual GraphX code in the project
> yet, just a framework to start working in.
> Hopefully this will keep the conversation going.
>
> Kyle
>
> On Fri, Nov 7, 2014 at 11:17 AM, Kushal Datta 
> wrote:
>
>> I think if we are going to use GraphX as the query engine in Tinkerpop3,
>> then the Tinkerpop3 community is the right platform to further the
>> discussion.
>>
>> The reason I asked the question on improving APIs in GraphX is because
>> why only Gremlin, any graph DSL can exploit the GraphX APIs. Cypher has
>> some good subgraph matching query interfaces which I believe can be
>> distributed using GraphX apis.
>>
>> An edge ID is an internal attribute of the edge generated automatically,
>> mostly hidden from the user. That's why adding it as an edge property might
>> not be a good idea. There are several little differences like this. E.g. in
>> Tinkerpop3 Gremlin implementation for Giraph, only vertex programs are
>> executed in Giraph directly. The side-effect operators are mapped to
>> Map-Reduce functions. In the implementation we are talking about, all of
>> these operations can be done within GraphX. I will be interested to
>> co-develop the query engine.
>>
>> @Reynold, I agree. And as I said earlier, the apis should be designed in
>> such a way that it can be used in any Graph DSL.
>>
>> On Fri, Nov 7, 2014 at 10:59 AM, Kyle Ellrott 
>> wrote:
>>
>>> Who here would be interested in helping to work on an implementation of
>>> the Tikerpop3 Gremlin API for Spark? Is this something that should continue
>>> in the Spark discussion group, or should it migrate to the Gremlin message
>>> group?
>>>
>>> Reynold is right that there will be inherent mismatches in the APIs, and
>>> there will need to be some discussions with the GraphX group about the best
>>> way to go. One example would be edge ids. GraphX has vertex ids, but no
>>> explicit edges ids, while Gremlin has both. Edge ids could be put into the
>>> attr field, but then that means the user would have to explicitly subclass
>>> their edge attribute to the edge attribute interface. Is that worth doing,
>>> versus adding an id to everyones's edges?
>>>
>>> Kyle
>>>
>>>
>>> On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin  wrote:
>>>
>>>> Some form of graph querying support would be great to have. This can be
>>>> a great community project hosted outside of Spark initially, both due to
>>>> the maturity of the component itself as well as the maturity of query
>>>> language standards (there isn't really a dominant standard for graph ql).
>>>>
>>>> One thing is that GraphX API will need to evolve and probably need to
>>>> provide more primitives in order to support the new ql implementation.
>>>> There might also be inherent mismatches in the way the external API is
>>>> defined vs what GraphX can support. We should discuss those on a
>>>> case-by-case basis.
>>>>
>>>>
>>>> On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott 
>>>> wrote:
>>>>
>>>>> I think its best to look to existing standard rather then try to make
>>>>> your own. Of course small additions would need to be added to make it
>>>>> valuable for the Spark community, like a method similar to Gremlin's
>>>>> 'table' function, that produces an RDD instead.
>>>>> But there may be a lot of extra code and data structures that would
>>>>> need to be added to make it work, and those may not be directly applicab

Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
Hi David,


Yes, we are still headed in that direction.
Please take a look at the repo I sent earlier.
I think that's a good starting point.

Thanks,
-Kushal.

On Thu, Jan 15, 2015 at 8:31 AM, David Robinson 
wrote:

> I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and
> think the idea of using Tinkerpop as the API for GraphX is a great idea and
> hope you are still headed in that direction.  I noticed that Tinkerpop 3 is
> moving into the Apache family:
> http://wiki.apache.org/incubator/TinkerPopProposal  which might alleviate
> concerns about having an API definition "outside" of Spark.
>
> Thanks,
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
The source code is under a new module named 'graphx'. let me double check.

On Fri, Jan 16, 2015 at 2:11 PM, Kyle Ellrott  wrote:

> Looking at https://github.com/kdatta/tinkerpop3/compare/graphx-gremlin I
> only see a maven build file. Do you have some source code some place else?
>
> I've worked on a spark based implementation (
> https://github.com/kellrott/spark-gremlin ), but its not done and I've
> been tied up on other projects.
> It also look Tinkerpop3 is a bit of a moving target. I had targeted the
> work done for gremlin-giraph (
> http://www.tinkerpop.com/docs/3.0.0.M5/#giraph-gremlin ) that was part of
> the M5 release, as a base model for implementation. But that appears to
> have been refactored into gremlin-hadoop (
> http://www.tinkerpop.com/docs/3.0.0.M6/#hadoop-gremlin ) in the M6
> release. I need to assess how much this changes the code.
>
> Most of the code that needs to be changes from Giraph to Spark will be
> simply replacing classes with spark derived ones. The main place where the
> logic will need changed is in the 'GraphComputer' class (
> https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java
> ) which is created by the Graph when the 'compute' method is called (
> https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/structure/HadoopGraph.java#L135
> ).
>
>
> Kyle
>
>
>
> On Fri, Jan 16, 2015 at 1:01 PM, Kushal Datta 
> wrote:
>
>> Hi David,
>>
>>
>> Yes, we are still headed in that direction.
>> Please take a look at the repo I sent earlier.
>> I think that's a good starting point.
>>
>> Thanks,
>> -Kushal.
>>
>> On Thu, Jan 15, 2015 at 8:31 AM, David Robinson 
>> wrote:
>>
>> > I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and
>> > think the idea of using Tinkerpop as the API for GraphX is a great idea
>> and
>> > hope you are still headed in that direction.  I noticed that Tinkerpop
>> 3 is
>> > moving into the Apache family:
>> > http://wiki.apache.org/incubator/TinkerPopProposal  which might
>> alleviate
>> > concerns about having an API definition "outside" of Spark.
>> >
>> > Thanks,
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>>
>
>


Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
code updated. sorry, wrong branch uploaded before.

On Fri, Jan 16, 2015 at 2:13 PM, Kushal Datta 
wrote:

> The source code is under a new module named 'graphx'. let me double check.
>
> On Fri, Jan 16, 2015 at 2:11 PM, Kyle Ellrott 
> wrote:
>
>> Looking at https://github.com/kdatta/tinkerpop3/compare/graphx-gremlin I
>> only see a maven build file. Do you have some source code some place else?
>>
>> I've worked on a spark based implementation (
>> https://github.com/kellrott/spark-gremlin ), but its not done and I've
>> been tied up on other projects.
>> It also look Tinkerpop3 is a bit of a moving target. I had targeted the
>> work done for gremlin-giraph (
>> http://www.tinkerpop.com/docs/3.0.0.M5/#giraph-gremlin ) that was part
>> of the M5 release, as a base model for implementation. But that appears to
>> have been refactored into gremlin-hadoop (
>> http://www.tinkerpop.com/docs/3.0.0.M6/#hadoop-gremlin ) in the M6
>> release. I need to assess how much this changes the code.
>>
>> Most of the code that needs to be changes from Giraph to Spark will be
>> simply replacing classes with spark derived ones. The main place where the
>> logic will need changed is in the 'GraphComputer' class (
>> https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/process/computer/giraph/GiraphGraphComputer.java
>> ) which is created by the Graph when the 'compute' method is called (
>> https://github.com/tinkerpop/tinkerpop3/blob/master/hadoop-gremlin/src/main/java/com/tinkerpop/gremlin/hadoop/structure/HadoopGraph.java#L135
>> ).
>>
>>
>> Kyle
>>
>>
>>
>> On Fri, Jan 16, 2015 at 1:01 PM, Kushal Datta 
>> wrote:
>>
>>> Hi David,
>>>
>>>
>>> Yes, we are still headed in that direction.
>>> Please take a look at the repo I sent earlier.
>>> I think that's a good starting point.
>>>
>>> Thanks,
>>> -Kushal.
>>>
>>> On Thu, Jan 15, 2015 at 8:31 AM, David Robinson 
>>> wrote:
>>>
>>> > I am new to Spark and GraphX, however, I use Tinkerpop backed graphs
>>> and
>>> > think the idea of using Tinkerpop as the API for GraphX is a great
>>> idea and
>>> > hope you are still headed in that direction.  I noticed that Tinkerpop
>>> 3 is
>>> > moving into the Apache family:
>>> > http://wiki.apache.org/incubator/TinkerPopProposal  which might
>>> alleviate
>>> > concerns about having an API definition "outside" of Spark.
>>> >
>>> > Thanks,
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> >
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> > Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>> >
>>>
>>
>>
>


Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Kushal Datta
I want to address the issue that Matei raised about the heavy lifting
required for a full SQL support. It is amazing that even after 30 years of
research there is not a single good open source columnar database like
Vertica. There is a column store option in MySQL, but it is not nearly as
sophisticated as Vertica or MonetDB. But there's a true need for such a
system. I wonder why so and it's high time to change that.
On Jan 26, 2015 5:47 PM, "Sandy Ryza"  wrote:

> Both SchemaRDD and DataFrame sound fine to me, though I like the former
> slightly better because it's more descriptive.
>
> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
> be more clear from a user-facing perspective to at least choose a package
> name for it that omits "sql".
>
> I would also be in favor of adding a separate Spark Schema module for Spark
> SQL to rely on, but I imagine that might be too large a change at this
> point?
>
> -Sandy
>
> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia 
> wrote:
>
> > (Actually when we designed Spark SQL we thought of giving it another
> name,
> > like Spark Schema, but we decided to stick with SQL since that was the
> most
> > obvious use case to many users.)
> >
> > Matei
> >
> > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia 
> > wrote:
> > >
> > > While it might be possible to move this concept to Spark Core
> long-term,
> > supporting structured data efficiently does require quite a bit of the
> > infrastructure in Spark SQL, such as query planning and columnar storage.
> > The intent of Spark SQL though is to be more than a SQL server -- it's
> > meant to be a library for manipulating structured data. Since this is
> > possible to build over the core API, it's pretty natural to organize it
> > that way, same as Spark Streaming is a library.
> > >
> > > Matei
> > >
> > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers  wrote:
> > >>
> > >> "The context is that SchemaRDD is becoming a common data format used
> for
> > >> bringing data into Spark from external systems, and used for various
> > >> components of Spark, e.g. MLlib's new pipeline API."
> > >>
> > >> i agree. this to me also implies it belongs in spark core, not sql
> > >>
> > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > >> michaelma...@yahoo.com.invalid> wrote:
> > >>
> > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> > Area
> > >>> Spark Meetup YouTube contained a wealth of background information on
> > this
> > >>> idea (mostly from Patrick and Reynold :-).
> > >>>
> > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > >>>
> > >>> 
> > >>> From: Patrick Wendell 
> > >>> To: Reynold Xin 
> > >>> Cc: "dev@spark.apache.org" 
> > >>> Sent: Monday, January 26, 2015 4:01 PM
> > >>> Subject: Re: renaming SchemaRDD -> DataFrame
> > >>>
> > >>>
> > >>> One thing potentially not clear from this e-mail, there will be a 1:1
> > >>> correspondence where you can get an RDD to/from a DataFrame.
> > >>>
> > >>>
> > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin 
> > wrote:
> >  Hi,
> > 
> >  We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> wanted
> > to
> >  get the community's opinion.
> > 
> >  The context is that SchemaRDD is becoming a common data format used
> > for
> >  bringing data into Spark from external systems, and used for various
> >  components of Spark, e.g. MLlib's new pipeline API. We also expect
> > more
> > >>> and
> >  more users to be programming directly against SchemaRDD API rather
> > than
> > >>> the
> >  core RDD API. SchemaRDD, through its less commonly used DSL
> originally
> >  designed for writing test cases, always has the data-frame like API.
> > In
> >  1.3, we are redesigning the API to make the API usable for end
> users.
> > 
> > 
> >  There are two motivations for the renaming:
> > 
> >  1. DataFrame seems to be a more self-evident name than SchemaRDD.
> > 
> >  2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> > (even
> >  though it would contain some RDD functions like map, flatMap, etc),
> > and
> >  calling it Schema*RDD* while it is not an RDD is highly confusing.
> > >>> Instead.
> >  DataFrame.rdd will return the underlying RDD for all RDD methods.
> > 
> > 
> >  My understanding is that very few users program directly against the
> >  SchemaRDD API at the moment, because they are not well documented.
> > >>> However,
> >  oo maintain backward compatibility, we can create a type alias
> > DataFrame
> >  that is still named SchemaRDD. This will maintain source
> compatibility
> > >>> for
> >  Scala. That said, we will have to update all existing materials to
> use
> >  DataFrame rather than SchemaRDD.
> > >>>
> > >>> -
> > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > >>> F

Re: [ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Kushal Datta
Kudos to the whole team for such a significant achievement!

On Fri, Mar 13, 2015 at 10:00 AM, Patrick Wendell 
wrote:

> Hi All,
>
> I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is
> the fourth release on the API-compatible 1.X line. It is Spark's
> largest release ever, with contributions from 172 developers and more
> than 1,000 commits!
>
> Visit the release notes [1] to read about the new features, or
> download [2] the release today.
>
> For errata in the contributions or release notes, please e-mail me
> *directly* (not on-list).
>
> Thanks to everyone who helped work on this release!
>
> [1] http://spark.apache.org/releases/spark-release-1-3-0.html
> [2] http://spark.apache.org/downloads.html
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Kushal Datta
Reynold, what's the idea behind using LLVM?

On Wed, Apr 1, 2015 at 12:31 AM, Akhil Das 
wrote:

> Nice try :)
>
> Thanks
> Best Regards
>
> On Wed, Apr 1, 2015 at 12:41 PM, Reynold Xin  wrote:
>
> > Hi Spark devs,
> >
> > I've spent the last few months investigating the feasibility of
> > re-architecting Spark for mobile platforms, considering the growing
> > population of Android/iOS users. I'm happy to share with you my findings
> at
> > https://issues.apache.org/jira/browse/SPARK-6646
> >
> > The tl;dr is that we should support running Spark on Android/iOS, and the
> > best way to do this at the moment is to use Scala.js to compile Spark
> code
> > into JavaScript, and then run it in Safari or Chrome (and even node.js
> > potentially for servers).
> >
> > If you are on your phones right now and prefer reading a blog post rather
> > than a PDF file, you can read more about the design doc at
> >
> >
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html
> >
> >
> > This is done in collaboration with TD, Xiangrui, Patrick. Look forward to
> > your feedback!
> >
>


Re: how long does it takes for full build ?

2015-04-16 Thread Kushal Datta
15-20mins.

On Thu, Apr 16, 2015 at 11:56 AM, Sree V 
wrote:

> Hi Team,
> How long does it takes for a full build 'mvn clean package' on spark
> 1.2.2-rc1 ?
>
>
> Thanking you.
>
> With Regards
> Sree


Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Kushal Datta
+1 (binding)

For tickets which span across multiple components, will it need to be
approved by all maintainers? For example, I'm working on the Python
bindings of GraphX where code is added to both Python and GraphX modules.

Thanks,
-Kushal.

On Thu, Nov 6, 2014 at 12:02 AM, Ankur Dave  wrote:

> +1 (binding)
>
> Ankur 
>
> On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia 
> wrote:
>
> > I'd like to formally call a [VOTE] on this model, to last 72 hours. The
> > [VOTE] will end on Nov 8, 2014 at 6 PM PST.
> >
>


Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kushal Datta
What do you guys think about the Tinkerpop3 Gremlin interface?
It has MapReduce to run Gremlin operators in a distributed manner and
Giraph to execute vertex programs.

The Tinkpop3 is better suited for GraphX.

On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott  wrote:

> I've taken a crack at implementing the TinkerPop Blueprints API in GraphX (
> https://github.com/kellrott/sparkgraph ). I've also implemented portions
> of
> the Gremlin Search Language and a Parquet based graph store.
> I've been working out finalize some code details and putting together
> better code examples and documentation before I started telling people
> about it.
> But if you want to start looking at the code, I can answer any questions
> you have. And if you would like to contribute, I would really appreciate
> the help.
>
> Kyle
>
>
> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin  wrote:
>
> > cc Matthias
> >
> > In the past we talked with Matthias and there were some discussions about
> > this.
> >
> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
> > brennon.y...@capitalone.com>
> > wrote:
> >
> > > All, was wondering if there had been any discussion around this topic
> > yet?
> > > TinkerPop  is a great abstraction for
> > graph
> > > databases and has been implemented across various graph database
> backends
> > > / gaining traction. Has anyone thought about integrating the TinkerPop
> > > framework with GraphX to enable GraphX as another backend? Not sure if
> > > this has been brought up or not, but would certainly volunteer to
> > > spearhead this effort if the community thinks it to be a good idea!
> > >
> > > As an aside, wasn¹t sure if this discussion should happen on the board
> > > here or on JIRA, but a made a ticket as well for reference:
> > > https://issues.apache.org/jira/browse/SPARK-4279
> > >
> > > 
> > >
> > > The information contained in this e-mail is confidential and/or
> > > proprietary to Capital One and/or its affiliates. The information
> > > transmitted herewith is intended only for use by the individual or
> entity
> > > to which it is addressed.  If the reader of this message is not the
> > > intended recipient, you are hereby notified that any review,
> > > retransmission, dissemination, distribution, copying or other use of,
> or
> > > taking of any action in reliance upon this information is strictly
> > > prohibited. If you have received this communication in error, please
> > > contact the sender and delete the material from your computer.
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> > >
> >
>


Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kushal Datta
Before we dive into the implementation details, what are the high level
thoughts on Gremlin/GraphX? Scala already provides the procedural way to
query graphs in GraphX today. So, today I can run
g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3
Gremlin, of course sans the useful operators that Gremlin offers such as
outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators
to GraphX api's a better approach or should we extend the existing set of
transformations/actions that GraphX already offers with the useful
operators from Gremlin? For example, we add as(), loop() and dedup()
methods in VertexRDD and EdgeRDD.

Either way we get a desperately needed graph query interface in GraphX.

On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon 
wrote:

> This was my thought exactly with the TinkerPop3 release. Looks like, to
> move this forward, we’d need to implement gremlin-core per <
> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>. The
> real question lies in whether GraphX can only support the OLTP
> functionality, or if we can bake into it the OLAP requirements as well. At
> a first glance I believe we could create an entire OLAP system. If so, I
> believe we could do this in a set of parallel subtasks, those being the
> implementation of each of the individual API’s (Structure, Process, and, if
> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>
>
> From: Kyle Ellrott 
> Date: Thursday, November 6, 2014 at 12:10 PM
> To: Kushal Datta 
> Cc: Reynold Xin , "York, Brennon" <
> brennon.y...@capitalone.com>, "dev@spark.apache.org" ,
> Matthias Broecheler 
> Subject: Re: Implementing TinkerPop on top of GraphX
>
> I still have to dig into the Tinkerpop3 internals (I started my work long
> before it had been released), but I can say that to get the Tinerpop2
> Gremlin pipeline to work in the GraphX was a bit of a hack. The
> whole Tinkerpop2 Gremlin design was based around streaming pipes of
> data, rather then large distributed map-reduce operations. I had to hack
> the pipes to aggregate all of the data and pass a single object wrapping
> the GraphX RDDs down the pipes in a single go, rather then streaming it
> element by element.
> Just based on their description, Tinkerpop3 may be more amenable to the
> Spark platform.
>
> Kyle
>
>
> On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta 
> wrote:
>
>> What do you guys think about the Tinkerpop3 Gremlin interface?
>> It has MapReduce to run Gremlin operators in a distributed manner and
>> Giraph to execute vertex programs.
>>
>> The Tinkpop3 is better suited for GraphX.
>>
>> On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott 
>> wrote:
>>
>>> I've taken a crack at implementing the TinkerPop Blueprints API in
>>> GraphX (
>>> https://github.com/kellrott/sparkgraph ). I've also implemented
>>> portions of
>>> the Gremlin Search Language and a Parquet based graph store.
>>> I've been working out finalize some code details and putting together
>>> better code examples and documentation before I started telling people
>>> about it.
>>> But if you want to start looking at the code, I can answer any questions
>>> you have. And if you would like to contribute, I would really appreciate
>>> the help.
>>>
>>> Kyle
>>>
>>>
>>> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin 
>>> wrote:
>>>
>>> > cc Matthias
>>> >
>>> > In the past we talked with Matthias and there were some discussions
>>> about
>>> > this.
>>> >
>>> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
>>> > brennon.y...@capitalone.com>
>>> > wrote:
>>> >
>>> > > All, was wondering if there had been any discussion around this topic
>>> > yet?
>>> > > TinkerPop <https://github.com/tinkerpop> is a great abstraction for
>>> > graph
>>> > > databases and has been implemented across various graph database
>>> backends
>>> > > / gaining traction. Has anyone thought about integrating the
>>> TinkerPop
>>> > > framework with GraphX to enable GraphX as another backend? Not sure
>>> if
>>> > > this has been brought up or not, but would certainly volunteer to
>>> > > spearhead this effort if the community thinks it to be a good idea!
>>> > >
>>> > > As an aside, wasn¹t sure if this discussion should happen on the
>>> board
>>> > > here or on JIRA, bu

Re: Implementing TinkerPop on top of GraphX

2014-11-07 Thread Kushal Datta
I think if we are going to use GraphX as the query engine in Tinkerpop3,
then the Tinkerpop3 community is the right platform to further the
discussion.

The reason I asked the question on improving APIs in GraphX is because why
only Gremlin, any graph DSL can exploit the GraphX APIs. Cypher has some
good subgraph matching query interfaces which I believe can be distributed
using GraphX apis.

An edge ID is an internal attribute of the edge generated automatically,
mostly hidden from the user. That's why adding it as an edge property might
not be a good idea. There are several little differences like this. E.g. in
Tinkerpop3 Gremlin implementation for Giraph, only vertex programs are
executed in Giraph directly. The side-effect operators are mapped to
Map-Reduce functions. In the implementation we are talking about, all of
these operations can be done within GraphX. I will be interested to
co-develop the query engine.

@Reynold, I agree. And as I said earlier, the apis should be designed in
such a way that it can be used in any Graph DSL.

On Fri, Nov 7, 2014 at 10:59 AM, Kyle Ellrott  wrote:

> Who here would be interested in helping to work on an implementation of
> the Tikerpop3 Gremlin API for Spark? Is this something that should continue
> in the Spark discussion group, or should it migrate to the Gremlin message
> group?
>
> Reynold is right that there will be inherent mismatches in the APIs, and
> there will need to be some discussions with the GraphX group about the best
> way to go. One example would be edge ids. GraphX has vertex ids, but no
> explicit edges ids, while Gremlin has both. Edge ids could be put into the
> attr field, but then that means the user would have to explicitly subclass
> their edge attribute to the edge attribute interface. Is that worth doing,
> versus adding an id to everyones's edges?
>
> Kyle
>
>
> On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin  wrote:
>
>> Some form of graph querying support would be great to have. This can be a
>> great community project hosted outside of Spark initially, both due to the
>> maturity of the component itself as well as the maturity of query language
>> standards (there isn't really a dominant standard for graph ql).
>>
>> One thing is that GraphX API will need to evolve and probably need to
>> provide more primitives in order to support the new ql implementation.
>> There might also be inherent mismatches in the way the external API is
>> defined vs what GraphX can support. We should discuss those on a
>> case-by-case basis.
>>
>>
>> On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott 
>> wrote:
>>
>>> I think its best to look to existing standard rather then try to make
>>> your own. Of course small additions would need to be added to make it
>>> valuable for the Spark community, like a method similar to Gremlin's
>>> 'table' function, that produces an RDD instead.
>>> But there may be a lot of extra code and data structures that would need
>>> to be added to make it work, and those may not be directly applicable to
>>> all GraphX users. I think it would be best run as a separate module/project
>>> that builds directly on top of GraphX.
>>>
>>> Kyle
>>>
>>>
>>>
>>> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon <
>>> brennon.y...@capitalone.com> wrote:
>>>
>>>> My personal 2c is that, since GraphX is just beginning to provide a
>>>> full featured graph API, I think it would be better to align with the
>>>> TinkerPop group rather than roll our own. In my mind the benefits out way
>>>> the detriments as follows:
>>>>
>>>> Benefits:
>>>> * GraphX gains the ability to become another core tenant within the
>>>> TinkerPop community allowing a more diverse group of users into the Spark
>>>> ecosystem.
>>>> * TinkerPop can continue to maintain and own a solid / feature-rich
>>>> graph API that has already been accepted by a wide audience, relieving the
>>>> pressure of “one off” API additions from the GraphX team.
>>>> * GraphX can demonstrate its ability to be a key player in the GraphDB
>>>> space sitting inline with other major distributions (Neo4j, Titan, etc.).
>>>> * Allows for the abstract graph traversal logic (query API) to be owned
>>>> and maintained by a group already proven on the topic.
>>>>
>>>> Drawbacks:
>>>> * GraphX doesn’t own the API for its graph query capability. This could
>>>> be seen as good or bad, but it might make GraphX-specific implementation
>>>> addition