Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Ankur Dave
+1 (binding) Ankur http://www.ankurdave.com/ On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST.

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Kushal Datta
+1 (binding) For tickets which span across multiple components, will it need to be approved by all maintainers? For example, I'm working on the Python bindings of GraphX where code is added to both Python and GraphX modules. Thanks, -Kushal. On Thu, Nov 6, 2014 at 12:02 AM, Ankur Dave

About implicit rddToPairRDDFunctions

2014-11-06 Thread Shixiong Zhu
I saw many people asked how to convert a RDD to a PairRDDFunctions. I would like to ask a question about it. Why not put the following implicit into pacakge object rdd or object rdd? implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)]) (implicit kt: ClassTag[K], vt: ClassTag[V],

JIRA + PR backlog

2014-11-06 Thread Sean Owen
(Different topic, indulge me one more reply --) Yes the number of JIRAs/PRs closed is unprecedented too and that deserves big praise. The project has stuck to making all changes and discussion in this public process, which is so powerful. Adjusted for the sheer inbound volume, Spark is doing a

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Imran Rashid
+1 overall also +1 to Sandy's suggestion to getting build maintainers as well. On Wed, Nov 5, 2014 at 7:57 PM, Sandy Ryza sandy.r...@cloudera.com wrote: This seems like a good idea. An area that wasn't listed, but that I think could strongly benefit from maintainers, is the build. Having

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Jason Dai
+1 (binding) On Thu, Nov 6, 2014 at 4:02 PM, Ankur Dave ankurd...@gmail.com wrote: +1 (binding) Ankur http://www.ankurdave.com/ On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE]

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread RJ Nowling
Matei, I saw that you're listed as a maintainer for ~6 different subcomponents, and on over half of those, you're only the 2nd person. My concern is that you would be stretched thin and maybe wouldn't be able to work as a back up on all of those subcomponents. Are you planning on adding more

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Tom Graves
+1. Tom On Wednesday, November 5, 2014 9:21 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW, my own vote is obviously +1 (binding). Matei On Nov 5, 2014, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Sean McNamara
+1 Sean On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Debasish Das
+1 The app to track PRs based on component is a great idea... On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: +1 Sean On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Nick Pentreath
+1 (binding) — Sent from Mailbox On Thu, Nov 6, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com wrote: +1 The app to track PRs based on component is a great idea... On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: +1 Sean On Nov 5, 2014, at 6:32

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Josh Rosen
+1 (binding). (our pull request browsing tool is open-source, by the way; contributions welcome: https://github.com/databricks/spark-pr-dashboard) On Thu, Nov 6, 2014 at 9:28 AM, Nick Pentreath nick.pentre...@gmail.com wrote: +1 (binding) — Sent from Mailbox On Thu, Nov 6, 2014 at 6:52

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread bc Wong
Hi Matei, Good call on scaling the project itself. Identifying domain experts in different areas is a good thing. But I have some questions about the implementation. Here's my understanding of the proposal: (1) The PMC votes on a list of components and their maintainers. Changes to that list

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Matei Zaharia
Hi BC, The point is exactly to ensure that the maintainers have looked at each patch to that component and consider it to fit consistently into its architecture. The issue is not about rogue committers, it's about making sure that changes don't accidentally sneak in that we want to roll back,

Implementing TinkerPop on top of GraphX

2014-11-06 Thread York, Brennon
All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a great abstraction for graph databases and has been implemented across various graph database backends / gaining traction. Has anyone thought about integrating the TinkerPop

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Reynold Xin
cc Matthias In the past we talked with Matthias and there were some discussions about this. On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.com wrote: All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kyle Ellrott
I've taken a crack at implementing the TinkerPop Blueprints API in GraphX ( https://github.com/kellrott/sparkgraph ). I've also implemented portions of the Gremlin Search Language and a Parquet based graph store. I've been working out finalize some code details and putting together better code

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread andy petrella
Great stuffs! I've got some thoughts about that, and I was wondering if it would be first interesting to have something like for spark-core (let's say): 0/ Core API offering basic (or advanced → HeLP) primitives 1/ catalyst optimizer for a text base system (SPARQL, Cypher, custom SQL3, whatnot) 2/

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kyle Ellrott
I still have to dig into the Tinkerpop3 internals (I started my work long before it had been released), but I can say that to get the Tinerpop2 Gremlin pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 Gremlin design was based around streaming pipes of data, rather then

Re: JIRA + PR backlog

2014-11-06 Thread Nicholas Chammas
I think better tooling will make it much easier for committers to trim the list of stale JIRA issues and PRs. Convenience enables action. - Spark PR Dashboard https://spark-prs.appspot.com/: Additional filters for stale PRs https://github.com/databricks/spark-pr-dashboard/issues/1 or PRs

Using partitioning to speed up queries in Shark

2014-11-06 Thread Gordon Benjamin
Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm doing and have a customers table with approximately 3 million rows that I've cached in memory. I've also created a partitioned table that I've also cached in memory on a per day basis FROM customers_cached INSERT

Re: Using partitioning to speed up queries in Shark

2014-11-06 Thread Nicholas Chammas
Did you mean to send this to the user list? This is the dev list, where we discuss things related to development on Spark itself. On Thu, Nov 6, 2014 at 5:01 PM, Gordon Benjamin gordon.benjami...@gmail.com wrote: Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm

Wrong temp directory when compressing before sending text file to S3

2014-11-06 Thread Gary Malouf
We have some data that we are exporting from our HDFS cluster to S3 with some help from Spark. The final RDD command we run is: csvData.saveAsTextFile(s3n://data/mess/2014/11/dump-oct-30-to-nov-5-gzip, classOf[GzipCodec]) We have our 'spark.local.dir' set to our large ephemeral partition on

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread bc Wong
On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia matei.zaha...@gmail.com wrote: ​snip Ultimately, the core motivation is that the project has grown to the point where it's hard to expect every committer to have full understanding of every component. Some committers know a ton about systems but

Re: Python3 and spark 1.1.0

2014-11-06 Thread Jeremy Freeman
Currently, Spark 1.1.0 works with Python 2.6 or higher, but not Python 3. There does seem to be interest, see also this post (http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-on-python-3-td15706.html). I believe Ariel Rokem (cced) has been trying to get it work and might be working

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread York, Brennon
This was my thought exactly with the TinkerPop3 release. Looks like, to move this forward, we’d need to implement gremlin-core per http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core. The real question lies in whether GraphX can only support the OLTP functionality, or if we can

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kyle Ellrott
I think I've already done most of the work for the OLTP objects (Graph, Element, Vertex, Edge, Properties) when implementing Tinkerpop2. Singleton write operations, like addVertex/deleteEdge, were cached locally until a read operation was requested, then the set of build operations where

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Patrick Wendell
I think new committers might or might not be maintainers (it would depend on the PMC vote). I don't think it would affect what you could merge, you can merge in any part of the source tree, you just need to get sign off if you want to touch a public API or make major architectural changes. Most

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kushal Datta
Before we dive into the implementation details, what are the high level thoughts on Gremlin/GraphX? Scala already provides the procedural way to query graphs in GraphX today. So, today I can run g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3 Gremlin, of course sans the

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Hari Shreedharan
In Cloudstack, I believe one becomes a maintainer first for a subset of modules, before he/she becomes a proven maintainter who has commit rights on the entire source tree.  So would it make sense to go that route, and have committers voted in as maintainers for certain parts of the

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Greg Stein
-1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Patrick Wendell
Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Patrick Wendell
In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Greg,

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread York, Brennon
My personal 2c is that, since GraphX is just beginning to provide a full featured graph API, I think it would be better to align with the TinkerPop group rather than roll our own. In my mind the benefits out way the detriments as follows: Benefits: * GraphX gains the ability to become another

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map`. -Xiangrui On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote: I reproduced the problem in mllib tests ALSSuite.scala using the following functions:

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
+1 (non-binding) [for original process proposal] Greg, the first time I've seen the word ownership on this thread is in your message. The first time the word lead has appeared in this thread is in your message as well. I don't think that was the intent. The PMC and Committers have a

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Greg Stein
Partial committers are people invited to work on a particular area, and they do not require sign-off to work on that area. They can get a sign-off and commit outside that area. That approach doesn't compare to this proposal. Full committers are PMC members. As each PMC member is responsible for

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Matei Zaharia
So I don't understand, Greg, are the partial committers committers, or are they not? Spark also has a PMC, but our PMC currently consists of all committers (we decided not to have a differentiation when we left the incubator). I see the Subversion partial committers listed as committers on

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
model.recommendProducts can only be called from the master then ? I have a set of 20% users on whom I am performing the test...the 20% users are in a RDD...if I have to collect them all to master node and then call model.recommendProducts, that's a issue... Any idea how to optimize this so that

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066 The easiest case is when one side is small. If both sides are large, this is a super-expensive operation. We can do block-wise cross product and then find top-k for each user. Best, Xiangrui On Thu, Nov 6, 2014 at 4:51 PM,

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
PMC [1] is responsible for oversight and does not designate partial or full committer. There are projects where all committers become PMC and others where PMC is reserved for committers with the most merit (and willingness to take on the responsibility of project oversight, releases, etc...).

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Sandy Ryza
It looks like the difference between the proposed Spark model and the CloudStack / SVN model is: * In the former, maintainers / partial committers are a way of centralizing oversight over particular components among committers * In the latter, maintainers / partial committers are a way of giving

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Kyle Ellrott
I think its best to look to existing standard rather then try to make your own. Of course small additions would need to be added to make it valuable for the Spark community, like a method similar to Gremlin's 'table' function, that produces an RDD instead. But there may be a lot of extra code and

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Reynold Xin
Some form of graph querying support would be great to have. This can be a great community project hosted outside of Spark initially, both due to the maturity of the component itself as well as the maturity of query language standards (there isn't really a dominant standard for graph ql). One

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Cody Koeninger
My 2 cents: Spark since pre-Apache days has been the most friendly and welcoming open source project I've seen, and that's reflected in its success. It seems pretty obvious to me that, for example, Michael should be looking at major changes to the SQL codebase. I trust him to do that in a way

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
I'm actually going to change my non-binding to +0 for the proposal as-is. I overlooked some parts of the original proposal that, when reading over them again, do not sit well with me. one of the maintainers needs to sign off on each patch to the component, as Greg has pointed out, does seem to

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Greg Stein
[ I'm going to try and pull a couple thread directions into this one, to avoid explosion :-) ] On Thu, Nov 6, 2014 at 6:44 PM, Corey Nolet cjno...@gmail.com wrote: Note: I'm going to use you generically; I understand you [Corey] are not a PMC member, at this time. +1 (non-binding) [for original

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Greg Stein
On Thu, Nov 6, 2014 at 7:28 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It looks like the difference between the proposed Spark model and the CloudStack / SVN model is: * In the former, maintainers / partial committers are a way of centralizing oversight over particular components among

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Matei Zaharia
Alright, Greg, I think I understand how Subversion's model is different, which is that the PMC members are all full committers. However, I still think that the model proposed here is purely organizational (how the PMC and committers organize themselves), and in no way changes peoples' ownership

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Greg Stein
[last reply for tonite; let others read; and after the next drink or three, I shouldn't be replying...] On Thu, Nov 6, 2014 at 11:38 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Alright, Greg, I think I understand how Subversion's model is different, which is that the PMC members are all

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Reynold Xin
Greg, Thanks a lot for commenting on this, but I feel we are splitting hairs here. Matei did mention -1, followed by or give feedback. The original process outlined by Matei was exactly about review, rather than fighting. Nobody wants to spend their energy fighting. Everybody is doing it to

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Vinod Kumar Vavilapalli
With the maintainer model, the process is as follows: - Any committer could review the patch and merge it, but they would need to forward it to me (or another core API maintainer) to make sure we also approve - At any point during this process, I could come in and -1 it, or give feedback