Re: FYI -- javax.servlet dependency issue workaround

2014-05-28 Thread Sean Owen
This class was introduced in Servlet 3.0. We have in the dependency tree some references to Servlet 2.5 and Servlet 3.0. The latter is a superset of the former. So we standardized on depending on Servlet 3.0. At least, that seems to have been successful in the Maven build, but this is just

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Nick Pentreath
+1 Built and tested locally on Mac OS X Built and tested on AWS Ubuntu, with and without Hive support Ran production jobs including MLlib and SparkSQL/HiveContext successfully On Wed, May 28, 2014 at 1:09 AM, Holden Karau hol...@pigscanfly.ca wrote: +1 (I did some very basic testing with

Standard preprocessing/scaling

2014-05-28 Thread dataginjaninja
I searched on this, but didn't find anything general so I apologize if this has been addressed. Many algorithms (SGD, SVM...) either will not converge or will run forever if the data is not scaled. Sci-kit has preprocessing

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Will Benton
+1 I made the necessary interface changes to my apps that use MLLib and tested all of my code against rc11 on Fedora 20 and OS X 10.9.3. (The Fedora Rawhide package remains at 0.9.1 pending some additional dependency packaging work.) best, wb - Original Message - From: Tathagata

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Henry Saputra
NOTICE and LICENSE files look good Signatures look good. Hashes look good No external executables in the source distributions Source compiled with sbt Run local and standalone examples look good. +1 - Henry On Mon, May 26, 2014 at 7:38 AM, Tathagata Das tathagata.das1...@gmail.com wrote:

LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Bharath Ravi Kumar
I'm looking to reuse the LogisticRegression model (with SGD) to predict a real-valued outcome variable. (I understand that logistic regression is generally applied to predict binary outcome, but for various reasons, this model suits our needs better than LinearRegression). Related to that I have

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Sean McNamara
Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through it. +1 Sean On May 26, 2014, at 8:39 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version

ContextCleaner, weak references, and serialization

2014-05-28 Thread Will Benton
Friends, For context (so to speak), I did some work in the 0.9 timeframe to fix SPARK-897 (provide immediate feedback when closures aren't serializable) and SPARK-729 (make sure that free variables in closures are captured when the RDD transformations are declared). I currently have a branch

Re: Kryo serialization for closures: a workaround

2014-05-28 Thread Will Benton
This is an interesting approach, Nilesh! Someone will correct me if I'm wrong, but I don't think this could go into ClosureCleaner as a default behavior (since Kryo apparently breaks on some classes that depend on custom Java serializers, as has come up on the list recently). But it does seem

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Tom Graves
+1. Tested spark on yarn (cluster mode, client mode, pyspark, spark-shell) on hadoop 0.23 and 2.4.  Tom On Wednesday, May 28, 2014 3:07 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark

Re: LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Xiangrui Meng
Please find my comments inline. -Xiangrui On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: I'm looking to reuse the LogisticRegression model (with SGD) to predict a real-valued outcome variable. (I understand that logistic regression is generally applied to

Re: Standard preprocessing/scaling

2014-05-28 Thread Xiangrui Meng
RowMatrix has a method to compute column summary statistics. There is a trade-off here because centering may densify the data. A utility function that centers data would be useful for dense datasets. -Xiangrui On Wed, May 28, 2014 at 5:03 AM, dataginjaninja rickett.stepha...@gmail.com wrote: I

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Xiangrui Meng
+1 Tested apps with standalone client mode and yarn cluster and client modes. Xiangrui On Wed, May 28, 2014 at 1:07 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Andy Konwinski
+1 On May 28, 2014 7:05 PM, Xiangrui Meng men...@gmail.com wrote: +1 Tested apps with standalone client mode and yarn cluster and client modes. Xiangrui On Wed, May 28, 2014 at 1:07 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Pulled down, compiled, and tested examples on OS X

Re: LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Christopher Nguyen
Bharath, (apologies if you're already familiar with the theory): the proposed approach may or may not be appropriate depending on the overall transfer function in your data. In general, a single logistic regressor cannot approximate arbitrary non-linear functions (of linear combinations of the

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Krishna Sankar
+1 Pulled built on MacOS X, EC2 Amazon Linux Ran test programs on OS X, 5 node c3.4xlarge cluster Cheers k/ On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski andykonwin...@gmail.comwrote: +1 On May 28, 2014 7:05 PM, Xiangrui Meng men...@gmail.com wrote: +1 Tested apps with standalone

Re: Standard preprocessing/scaling

2014-05-28 Thread DB Tsai
Sometimes for this case, I will just standardize without centerization. I still get good result. Sent from my Google Nexus 5 On May 28, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote: RowMatrix has a method to compute column summary statistics. There is a trade-off here because centering

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Kevin Markey
+1 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 Ran current version of one of my applications on 1-node pseudocluster (sorry, unable to test on full cluster). yarn-cluster mode Ran regression tests. Thanks Kevin On 05/28/2014 09:55 PM, Krishna Sankar wrote: +1 Pulled built on MacOS X,