Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Reynold Xin
Alright I have merged the patch ( https://github.com/apache/spark/pull/4173 ) since I don't see any strong opinions against it (as a matter of fact most were for it). We can still change it if somebody lays out a strong argument. On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia wrote: > The type

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Reynold Xin
+1 Tested on Mac OS X On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar wrote: > +1 > 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min > mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 > -Dhadoop.version=2.6.0 -Phive -DskipTests > 2. Tested pyspark, mlib - running as wel

Re: Friendly reminder/request to help with reviews!

2015-01-27 Thread Gurumurthy Yeleswarapu
Hi Patrick: I would love to help reviewing in any way I can. Im fairly new here. Can you help with a pointer to get me started. Thanks From: Patrick Wendell To: "dev@spark.apache.org" Sent: Tuesday, January 27, 2015 3:56 PM Subject: Friendly reminder/request to help with reviews! H

Friendly reminder/request to help with reviews!

2015-01-27 Thread Patrick Wendell
Hey All, Just a reminder, as always around release time we have a very large volume of patches show up near the deadline. One thing that can help us maximize the number of patches we get in is to have community involvement in performing code reviews. And in particular, doing a thorough review and

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Krishna Sankar
+1 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.1.x & 1.2.0 2.1. statistics OK 2.2. Linear/Ridge/Laso Regression

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Matei Zaharia
The type alias means your methods can specify either type and they will work. It's just another name for the same type. But Scaladocs and such will show DataFrame as the type. Matei > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho > wrote: > > Reynold, > But with type alias we will hav

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Dirceu Semighini Filho
Reynold, But with type alias we will have the same problem, right? If the methods doesn't receive schemardd anymore, we will have to change our code to migrade from schema to dataframe. Unless we have an implicit conversion between DataFrame and SchemaRDD 2015-01-27 17:18 GMT-02:00 Reynold Xin :

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Dmitriy Lyubimov
It has been pretty evident for some time that's what it is, hasn't it? Yes that's a better name IMO. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin wrote: > Hi, > > We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to > get the community's opinion. > > The context is that Sche

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Koert Kuipers
thats great. guess i was looking at a somewhat stale master branch... On Tue, Jan 27, 2015 at 2:19 PM, Reynold Xin wrote: > Koert, > > As Mark said, I have already refactored the API so that nothing is > catalyst is exposed (and users won't need them anyway). Data types, Row > interfaces are bot

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Reynold Xin
Koert, As Mark said, I have already refactored the API so that nothing is catalyst is exposed (and users won't need them anyway). Data types, Row interfaces are both outside catalyst package and in org.apache.spark.sql. On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers wrote: > hey matei, > i thin

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Reynold Xin
Dirceu, That is not possible because one cannot overload return types. SQLContext.parquetFile (and many other methods) needs to return some type, and that type cannot be both SchemaRDD and DataFrame. In 1.3, we will create a type alias for DataFrame called SchemaRDD to not break source compatibi

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Okay - we've resolved all issues with the signatures and keys. However, I'll leave the current vote open for a bit to solicit additional feedback. On Tue, Jan 27, 2015 at 10:43 AM, Sean McNamara wrote: > Sounds good, that makes sense. > > Cheers, > > Sean > >> On Jan 27, 2015, at 11:35 AM, Patric

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Sean McNamara
Sounds good, that makes sense. Cheers, Sean > On Jan 27, 2015, at 11:35 AM, Patrick Wendell wrote: > > Hey Sean, > > Right now we don't publish every 2.11 binary to avoid combinatorial > explosion of the number of build artifacts we publish (there are other > parameters such as whether hive i

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Hey Sean, Right now we don't publish every 2.11 binary to avoid combinatorial explosion of the number of build artifacts we publish (there are other parameters such as whether hive is included, etc). We can revisit this in future feature releases, but .1 releases like this are reserved for bug fix

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Sean McNamara
We’re using spark on scala 2.11 /w hadoop2.4. Would it be practical / make sense to build a bin version of spark against scala 2.11 for versions other than just hadoop1 at this time? Cheers, Sean > On Jan 27, 2015, at 12:04 AM, Patrick Wendell wrote: > > Please vote on releasing the follow

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Yes - the key issue is just due to me creating new keys this time around. Anyways let's take another stab at this. In the mean time, please don't hesitate to test the release itself. - Patrick On Tue, Jan 27, 2015 at 10:00 AM, Sean Owen wrote: > Got it. Ignore the SHA512 issue since these aren't

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Sean Owen
Got it. Ignore the SHA512 issue since these aren't somehow expected by a policy or Maven to be in a certain format. Just wondered if the difference was intended. The Maven way of generated the SHA1 hashes is to set this on the install plugin, AFAIK, although I'm not sure if the intent was to hash

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Michael Malak
I personally have no preference DataFrame vs. DataTable, but only wish to lay out the history and etymology simply because I'm into that sort of thing. "Frame" comes from Marvin Minsky's 1970's AI construct: "slots" and the data that go in them. The S programming language (precursor to R) adopte

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Hey Sean, The release script generates hashes in two places (take a look a bit further down in the script), one for the published artifacts and the other for the binaries. In the case of the binaries we use SHA512 because, AFAIK, the ASF does not require you to use SHA1 and SHA512 is better. In th

Re: Maximum size of vector that reduce can handle

2015-01-27 Thread Boromir Widas
I am running into this issue as well, when storing large Arrays as the value in a kv pair and then doing a reducebykey. Can one of the experts please comment if it would make sense to add an operation to add values in place like accumulators do - this would essentially merge the vectors for a given

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Mark Hamstra
In master, Reynold has already taken care of moving Row into org.apache.spark.sql; so, even though the implementation of Row (and GenericRow et al.) is in Catalyst (which is more optimizer than parser), that needn't be of concern to users of the API in its most recent state. On Tue, Jan 27, 2015 a

UnknownHostException while running YarnTestSuite

2015-01-27 Thread Iulian Dragoș
Hi, I’m trying to run the Spark test suite on an EC2 instance, but I can’t get Yarn tests to pass. The hostname I get on that machine is not resolvable, but adding a line in /etc/hosts makes the other tests pass, except for Yarn tests. Any help is greatly appreciated! thanks, iulian ubuntu@ip-1

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Koert Kuipers
hey matei, i think that stuff such as SchemaRDD, columar storage and perhaps also query planning can be re-used by many systems that do analysis on structured data. i can imagine panda-like systems, but also datalog or scalding-like (which we use at tresata and i might rebase on SchemaRDD at some p

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Evan R. Sparks
I'm +1 on this, although a little worried about unknowingly introducing SparkSQL dependencies every time someone wants to use this. It would be great if the interface can be abstract and the implementation (in this case, SparkSQL backend) could be swapped out. One alternative suggestion on the nam

Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-27 Thread Evan R. Sparks
Hmm... Scaler and Scalar are very close together both in terms of pronunciation and spelling - and I wouldn't want to create confusion between the two. Further - this operation (elementwise multiplication by a static vector) is general enough that maybe it should have a more general name? On Tue,

Re: Maximum size of vector that reduce can handle

2015-01-27 Thread Xiangrui Meng
60m-vector costs 480MB memory. You have 12 of them to be reduced to the driver. So you need ~6GB memory not counting the temp vectors generated from '_+_'. You need to increase driver memory to make it work. That being said, ~10^7 hits the limit for the current impl of glm. -Xiangrui On Jan 23, 201

Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-27 Thread Xiangrui Meng
I would call it Scaler. You might want to add it to the spark.ml pipieline api. Please check the spark.ml.HashingTF implementation. Note that this should handle sparse vectors efficiently. Hadamard and FFTs are quite useful. If you are intetested, make sure that we call an FFT libary that is licen

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Dirceu Semighini Filho
Can't the SchemaRDD remain the same, but deprecated, and be removed in the release 1.5(+/- 1) for example, and the new code been added to DataFrame? With this, we don't impact in existing code for the next few releases. 2015-01-27 0:02 GMT-02:00 Kushal Datta : > I want to address the issue tha

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Sean Owen
I think there are several signing / hash issues that should be fixed before this release. Hashes: http://issues.apache.org/jira/browse/SPARK-5308 https://github.com/apache/spark/pull/4161 The hashes here are correct, but have two issues: As noted in the JIRA, the format of the hash file is "non

Re: Use mvn to build Spark 1.2.0 failed

2015-01-27 Thread Sean Owen
You certainly do not need yo build Spark as root. It might clumsily overcome a permissions problem in your local env but probably causes other problems. On Jan 27, 2015 11:18 AM, "angel__" wrote: > I had that problem when I tried to build Spark 1.2. I don't exactly know > what > is causing it, bu

Re: Use mvn to build Spark 1.2.0 failed

2015-01-27 Thread angel__
I had that problem when I tried to build Spark 1.2. I don't exactly know what is causing it, but I guess it might have something to do with user permissions. I could finally fix this by building Spark as "root" user (now I'm dealing with another problem, but ...that's another story...) -- View

Re: talk on interface design

2015-01-27 Thread Reynold Xin
Thanks, Andrew. That's great material. On Mon, Jan 26, 2015 at 10:23 PM, Andrew Ash wrote: > In addition to the references you have at the end of the presentation, > there's a great set of practical examples based on the learnings from Qt > posted here: http://www21.in.tum.de/~blanchet/api-desi