Re: GraphX triplets on 5-node graph

2014-05-28 Thread Reynold Xin
Take a look at this one: https://issues.apache.org/jira/browse/SPARK-1188 It was an optimization that added user inconvenience. We got rid of that now in Spark 1.0. On Wed, May 28, 2014 at 11:48 PM, Michael Malak wrote: > Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL)

Re: Suggestion: RDD cache depth

2014-05-28 Thread Matei Zaharia
This is a pretty cool idea — instead of cache depth I’d call it something like reference counting. Would you mind opening a JIRA issue about it? The issue of really composing together libraries that use RDDs nicely isn’t fully explored, but this is certainly one thing that would help with it. I’

GraphX triplets on 5-node graph

2014-05-28 Thread Michael Malak
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I missing something fundamental? val nodes = sc.parallelize(Array((1L, "N1"), (2L, "N2"), (3L, "N3"), (4L, "N4"), (5L, "N5"))) val edges = sc.parallelize(Array(Edge(1L, 2L, "E1"), Edge(1L, 3L, "E2"), Edge(2L, 4L, "E

Suggestion: RDD cache depth

2014-05-28 Thread innowireless TaeYun Kim
It would be nice if the RDD cache() method incorporate a depth information. That is, void test() { JavaRDD<.> rdd = .; rdd.cache(); // to depth 1. actual caching happens. rdd.cache(); // to depth 2. Nop as long as the storage level is the same. Else, exception. . rdd.uncache(); // t

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Kevin Markey
+1 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 Ran current version of one of my applications on 1-node pseudocluster (sorry, unable to test on full cluster). yarn-cluster mode Ran regression tests. Thanks Kevin On 05/28/2014 09:55 PM, Krishna Sankar wrote: +1 Pulled & built on MacOS X,

Re: Standard preprocessing/scaling

2014-05-28 Thread DB Tsai
Sometimes for this case, I will just standardize without centerization. I still get good result. Sent from my Google Nexus 5 On May 28, 2014 7:03 PM, "Xiangrui Meng" wrote: > RowMatrix has a method to compute column summary statistics. There is > a trade-off here because centering may densify th

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Krishna Sankar
+1 Pulled & built on MacOS X, EC2 Amazon Linux Ran test programs on OS X, 5 node c3.4xlarge cluster Cheers On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski wrote: > +1 > On May 28, 2014 7:05 PM, "Xiangrui Meng" wrote: > > > +1 > > > > Tested apps with standalone client mode and yarn cluster and

Re: LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Christopher Nguyen
Bharath, (apologies if you're already familiar with the theory): the proposed approach may or may not be appropriate depending on the overall transfer function in your data. In general, a single logistic regressor cannot approximate arbitrary non-linear functions (of linear combinations of the inpu

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Andy Konwinski
+1 On May 28, 2014 7:05 PM, "Xiangrui Meng" wrote: > +1 > > Tested apps with standalone client mode and yarn cluster and client modes. > > Xiangrui > > On Wed, May 28, 2014 at 1:07 PM, Sean McNamara > wrote: > > Pulled down, compiled, and tested examples on OS X and ubuntu. > > Deployed app we a

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Xiangrui Meng
+1 Tested apps with standalone client mode and yarn cluster and client modes. Xiangrui On Wed, May 28, 2014 at 1:07 PM, Sean McNamara wrote: > Pulled down, compiled, and tested examples on OS X and ubuntu. > Deployed app we are building on spark and poured data through it. > > +1 > > Sean > > >

Re: Standard preprocessing/scaling

2014-05-28 Thread Xiangrui Meng
RowMatrix has a method to compute column summary statistics. There is a trade-off here because centering may densify the data. A utility function that centers data would be useful for dense datasets. -Xiangrui On Wed, May 28, 2014 at 5:03 AM, dataginjaninja wrote: > I searched on this, but didn't

Re: LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Xiangrui Meng
Please find my comments inline. -Xiangrui On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar wrote: > I'm looking to reuse the LogisticRegression model (with SGD) to predict a > real-valued outcome variable. (I understand that logistic regression is > generally applied to predict binary outcome

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Tom Graves
+1. Tested spark on yarn (cluster mode, client mode, pyspark, spark-shell) on hadoop 0.23 and 2.4.  Tom On Wednesday, May 28, 2014 3:07 PM, Sean McNamara wrote: Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through it.

Re: Kryo serialization for closures: a workaround

2014-05-28 Thread Will Benton
This is an interesting approach, Nilesh! Someone will correct me if I'm wrong, but I don't think this could go into ClosureCleaner as a default behavior (since Kryo apparently breaks on some classes that depend on custom Java serializers, as has come up on the list recently). But it does seem

ContextCleaner, weak references, and serialization

2014-05-28 Thread Will Benton
Friends, For context (so to speak), I did some work in the 0.9 timeframe to fix SPARK-897 (provide immediate feedback when closures aren't serializable) and SPARK-729 (make sure that free variables in closures are captured when the RDD transformations are declared). I currently have a branch

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Sean McNamara
Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through it. +1 Sean On May 26, 2014, at 8:39 AM, Tathagata Das wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.0.0! > > This has a few i

LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Bharath Ravi Kumar
I'm looking to reuse the LogisticRegression model (with SGD) to predict a real-valued outcome variable. (I understand that logistic regression is generally applied to predict binary outcome, but for various reasons, this model suits our needs better than LinearRegression). Related to that I have th

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Henry Saputra
NOTICE and LICENSE files look good Signatures look good. Hashes look good No external executables in the source distributions Source compiled with sbt Run local and standalone examples look good. +1 - Henry On Mon, May 26, 2014 at 7:38 AM, Tathagata Das wrote: > Please vote on releasing the fo

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Will Benton
+1 I made the necessary interface changes to my apps that use MLLib and tested all of my code against rc11 on Fedora 20 and OS X 10.9.3. (The Fedora Rawhide package remains at 0.9.1 pending some additional dependency packaging work.) best, wb - Original Message - > From: "Tathagata

Standard preprocessing/scaling

2014-05-28 Thread dataginjaninja
I searched on this, but didn't find anything general so I apologize if this has been addressed. Many algorithms (SGD, SVM...) either will not converge or will run forever if the data is not scaled. Sci-kit has preprocessing

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Nick Pentreath
+1 Built and tested locally on Mac OS X Built and tested on AWS Ubuntu, with and without Hive support Ran production jobs including MLlib and SparkSQL/HiveContext successfully On Wed, May 28, 2014 at 1:09 AM, Holden Karau wrote: > +1 (I did some very basic testing with PySpark & Pandas on rc11)

Re: FYI -- javax.servlet dependency issue workaround

2014-05-28 Thread Sean Owen
This class was introduced in Servlet 3.0. We have in the dependency tree some references to Servlet 2.5 and Servlet 3.0. The latter is a superset of the former. So we standardized on depending on Servlet 3.0. At least, that seems to have been successful in the Maven build, but this is just evidenc