Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Corey Nolet
+1 (non-binding) - Verified signatures - Built on Mac OS X and Fedora 21. On Mon, Mar 9, 2015 at 11:01 PM, Krishna Sankar wrote: > Excellent, Thanks Xiangrui. The mystery is solved. > Cheers > > > > On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng wrote: > > > Krishna, I tested your linear regre

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Krishna Sankar
Excellent, Thanks Xiangrui. The mystery is solved. Cheers On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng wrote: > Krishna, I tested your linear regression example. For linear > regression, we changed its objective function from 1/n * \|A x - > b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent

Re: enum-like types in Spark

2015-03-09 Thread Patrick Wendell
Does this matter for our own internal types in Spark? I don't think any of these types are designed to be used in RDD records, for instance. On Mon, Mar 9, 2015 at 6:25 PM, Aaron Davidson wrote: > Perhaps the problem with Java enums that was brought up was actually that > their hashCode is not st

Re: enum-like types in Spark

2015-03-09 Thread Aaron Davidson
Perhaps the problem with Java enums that was brought up was actually that their hashCode is not stable across JVMs, as it depends on the memory location of the enum itself. On Mon, Mar 9, 2015 at 6:15 PM, Imran Rashid wrote: > Can you expand on the serde issues w/ java enum's at all? I haven't

Re: enum-like types in Spark

2015-03-09 Thread Imran Rashid
Can you expand on the serde issues w/ java enum's at all? I haven't heard of any problems specific to enums. The java object serialization rules seem very clear and it doesn't seem like different jvms should have a choice on what they do: http://docs.oracle.com/javase/6/docs/platform/serializati

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Sam Halliday
Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, "Ulanov, Alexander" wrote: > Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the > comment that BIDMat 0.9.7

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Joseph Bradley
+1 Tested on Mac OS X On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng wrote: > Krishna, I tested your linear regression example. For linear > regression, we changed its objective function from 1/n * \|A x - > b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least > squares formulati

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Xiangrui Meng
Krishna, I tested your linear regression example. For linear regression, we changed its objective function from 1/n * \|A x - b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least squares formulations. It means you could re-produce the same result by multiplying the step size by 2.

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-09 Thread Andrew Ash
Does the Apache project team have any ability to measure download counts of the various releases? That data could be useful when it comes time to sunset vendor-specific releases, like CDH4 for example. On Mon, Mar 9, 2015 at 5:34 AM, Mridul Muralidharan wrote: > In ideal situation, +1 on removi

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Ulanov, Alexander
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https:/

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Sean Owen
I'm +1 as I have not heard of any one else seeing the Hive test failure, which is likely a test issue rather than code issue anyway, and not a blocker. On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen wrote: > Although the problem is small, especially if indeed the essential docs > changes are following

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Kostas Sakellis
+1 on RC3 I agree that this should not block the release. Once we have a fix for it, putting it in a double dot release sounds like a good plan. Kostas On Mon, Mar 9, 2015 at 11:27 AM, Patrick Wendell wrote: > Hey All, > > Today there was a JIRA posted with an observed regression around Spar

Cross cutting internal changes to launch scripts

2015-03-09 Thread Patrick Wendell
Hey All, Marcelo Vanzin has been working on a patch for a few months that performs cross cutting clean-up and fixes to the way that Spark's launch scripts work (including PySpark, spark submit, the daemon scripts, etc.). The changes won't modify any public API's in terms of how those scripts are i

RE: Loading previously serialized object to Spark

2015-03-09 Thread Ulanov, Alexander
Looking forward to use those features! Can I somehow make the model that I saved with ObjectOutputStream work with RDD map? It took 7 hours to build it :) -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Monday, March 09, 2015 12:32 PM To: Ulanov, Alexander Cc: Ak

Re: Loading previously serialized object to Spark

2015-03-09 Thread Xiangrui Meng
Well, it is the standard "hacky" way for model save/load in MLlib. We have SPARK-4587 and SPARK-5991 to provide save/load for all MLlib models, in an exchangeable format. -Xiangrui On Mon, Mar 9, 2015 at 12:25 PM, Ulanov, Alexander wrote: > Thanks so much! It works! Is it the standard way for Mll

RE: Loading previously serialized object to Spark

2015-03-09 Thread Ulanov, Alexander
Thanks so much! It works! Is it the standard way for Mllib models to be serialized? Btw. The example I pasted below works if one implements a TestSuite with MLlibTestSparkContext. -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Monday, March 09, 2015 12:10 PM To:

Re: Loading previously serialized object to Spark

2015-03-09 Thread Xiangrui Meng
Could you try `sc.objectFile` instead? sc.parallelize(Seq(model), 1).saveAsObjectFile("path") val sameModel = sc.objectFile[NaiveBayesModel]("path").first() -Xiangrui On Mon, Mar 9, 2015 at 11:52 AM, Ulanov, Alexander wrote: > Just tried, the same happens if I use the internal Spark serializer:

RE: Loading previously serialized object to Spark

2015-03-09 Thread Ulanov, Alexander
Just tried, the same happens if I use the internal Spark serializer: val serializer = SparkEnv.get.closureSerializer.newInstance -Original Message- From: Ulanov, Alexander Sent: Monday, March 09, 2015 10:37 AM To: Akhil Das Cc: dev Subject: RE: Loading previously serialized object to Sp

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Patrick Wendell
Hey All, Today there was a JIRA posted with an observed regression around Spark Streaming during certain recovery scenarios: https://issues.apache.org/jira/browse/SPARK-6222 My preference is to go ahead and ship this release (RC3) as-is and if this issue is isolated resolved soon, we can make a

Re: How to implement unsupervised or reinforcement algorithm in new org.apache.spark.ml

2015-03-09 Thread Joseph Bradley
Hi, There are no examples currently. For unsupervised learning, I think the pattern is straightforward. It would follow the pattern from supervised learning, but without the label input column and with a model having a different transform() behavior. Reinforcement learning might take a bit more

RE: Loading previously serialized object to Spark

2015-03-09 Thread Ulanov, Alexander
Below is the code with standard MLlib class. Apparently this issue can happen in the same Spark instance. import java.io._ import org.apache.spark.mllib.classification.NaiveBayes import org.apache.spark.mllib.classification.NaiveBayesModel import org.apache.spark.mllib.util.MLUtils val data = M

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Denny Lee
+1 (non-binding) Spark Standalone and YARN on Hadoop 2.6 on OSX plus various tests (MLLib, SparkSQL, etc.) On Mon, Mar 9, 2015 at 9:18 AM Tom Graves wrote: > +1. Built from source and ran Spark on yarn on hadoop 2.6 in cluster and > client mode. > Tom > > On Thursday, March 5, 2015 8:53 PM

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Tom Graves
+1. Built from source and ran Spark on yarn on hadoop 2.6 in cluster and client mode. Tom On Thursday, March 5, 2015 8:53 PM, Patrick Wendell wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Sean McNamara
+1 Ran local tests and tested our spark apps on a spark+yarn cluster. Cheers, Sean > On Mar 8, 2015, at 11:51 PM, Sandy Ryza wrote: > > +1 (non-binding, doc and packaging issues aside) > > Built from source, ran jobs and spark-shell against a pseudo-distributed > YARN cluster. > > On Sun,

How to implement unsupervised or reinforcement algorithm in new org.apache.spark.ml

2015-03-09 Thread Egor Pahomov
Hi, I'm redoing my PR about genetic algorithm in new org.apache.spark.ml architecture. Do we have already some code about handling unsupervised or reinforcement algorithm in new architecture? If no do we have some tickets on this matter? If no do we have

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-09 Thread Mridul Muralidharan
In ideal situation, +1 on removing all vendor specific builds and making just hadoop version specific - that is what we should depend on anyway. Though I hope Sean is correct in assuming that vendor specific builds for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause incompatibilities for

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-09 Thread Sean Owen
Yes, you should always find working bits at Apache no matter what -- though 'no matter what' really means 'as long as you use Hadoop distro compatible with upstream Hadoop'. Even distros have a strong interest in that, since the market, the 'pie', is made large by this kind of freedom at the core.

Re: missing explanation of cache in the documentation of cluster overview

2015-03-09 Thread Sean Owen
It's explained at https://spark.apache.org/docs/latest/programming-guide.html and it's configuration at https://spark.apache.org/docs/latest/configuration.html Have a read over all the docs first. On Mon, Mar 9, 2015 at 9:24 AM, Hui WANG wrote: > Hello Guys, > > I'm reading the documentation of

missing explanation of cache in the documentation of cluster overview

2015-03-09 Thread Hui WANG
Hello Guys, I'm reading the documentation of cluster mode overview on https://spark.apache.org/docs/latest/cluster-overview.html. In the schema, cache is shown aside executor but no explanation is done on it. Can someone please help to explain it and improve this page ? -- Hui WANG Tel : +33 (

Re: GSoC 2015

2015-03-09 Thread Akhil Das
This might help https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Mon, Mar 9, 2015 at 9:23 AM, David J. Manglano wrote: > Hi Spark devs! > > I'm writing regarding your GSoC 2015 project idea. I'm a graduate student > with experience in Python and dis