Re: FYI -- javax.servlet dependency issue workaround

2014-05-28 Thread Sean Owen
This class was introduced in Servlet 3.0. We have in the dependency
tree some references to Servlet 2.5 and Servlet 3.0. The latter is a
superset of the former. So we standardized on depending on Servlet
3.0.

At least, that seems to have been successful in the Maven build, but
this is just evidence that the SBT build ends up including Servlet 2.5
dependencies.

You shouldn't have to work around it in this way. Let me see if I can
debug and propose the right exclusion for SBT.

(Related: is the SBT build going to continue to live separately from
Maven, or is it going to be auto-generated? that is -- worth fixing
this?)


On Wed, May 28, 2014 at 6:08 AM, Kay Ousterhout  wrote:
> Hi all,
>
> I had some trouble compiling an application (Shark) against Spark 1.0,
> where Shark had a runtime exception (at the bottom of this message) because
> it couldn't find the javax.servlet classes.  SBT seemed to have trouble
> downloading the servlet APIs that are dependencies of Jetty (used by the
> Spark web UI), so I had to manually add them to the application's build
> file:
>
> libraryDependencies += "org.mortbay.jetty" % "servlet-api" % "3.0.20100224"
>
> Not exactly sure why this happens but thought it might be useful in case
> others run into the same problem.
>
> -Kay
>
> -
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> javax/servlet/FilterRegistration
>
> at
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:136)
>
> at
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:129)
>
> at
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:98)
>
> at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:98)
>
> at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:89)
>
> at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:64)
>
> at org.apache.spark.ui.WebUI$anonfun$attachTab$1.apply(WebUI.scala:57)
>
> at org.apache.spark.ui.WebUI$anonfun$attachTab$1.apply(WebUI.scala:57)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:57)
>
> at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:66)
>
> at org.apache.spark.ui.SparkUI.(SparkUI.scala:60)
>
> at org.apache.spark.ui.SparkUI.(SparkUI.scala:42)
>
> at org.apache.spark.SparkContext.(SparkContext.scala:222)
>
> at org.apache.spark.SparkContext.(SparkContext.scala:85)
>
> at shark.SharkContext.(SharkContext.scala:42)
>
> at shark.SharkContext.(SharkContext.scala:61)
>
> at shark.SharkEnv$.initWithSharkContext(SharkEnv.scala:78)
>
> at shark.SharkEnv$.init(SharkEnv.scala:38)
>
> at shark.SharkCliDriver.(SharkCliDriver.scala:280)
>
> at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)
>
> at shark.SharkCliDriver.main(SharkCliDriver.scala)
>
> Caused by: java.lang.ClassNotFoundException:
> javax.servlet.FilterRegistration
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
> ... 23 more


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Nick Pentreath
+1
Built and tested locally on Mac OS X
Built and tested on AWS Ubuntu, with and without Hive support
Ran production jobs including MLlib and SparkSQL/HiveContext successfully


On Wed, May 28, 2014 at 1:09 AM, Holden Karau  wrote:

> +1 (I did some very basic testing with PySpark & Pandas on rc11)
>
>
> On Tue, May 27, 2014 at 3:53 PM, Mark Hamstra  >wrote:
>
> > +1
> >
> >
> > On Tue, May 27, 2014 at 9:26 AM, Ankur Dave  wrote:
> >
> > > 0
> > >
> > > OK, I withdraw my downvote.
> > >
> > > Ankur 
> > >
> >
>
>
>
> --
> Cell : 425-233-8271
>


Standard preprocessing/scaling

2014-05-28 Thread dataginjaninja
I searched on this, but didn't find anything general so I apologize if this
has been addressed. 

Many algorithms (SGD, SVM...) either will not converge or will run forever
if the data is not scaled. Sci-kit has  preprocessing

  
that will subtract the mean and divide by standard dev. Of course there are
a few options with it as well.

Is there something in the works for this?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Standard-preprocessing-scaling-tp6826.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Will Benton
+1

I made the necessary interface changes to my apps that use MLLib and tested all 
of my code against rc11 on Fedora 20 and OS X 10.9.3.  (The Fedora Rawhide 
package remains at 0.9.1 pending some additional dependency packaging work.)


best,
wb


- Original Message -
> From: "Tathagata Das" 
> To: dev@spark.apache.org
> Sent: Monday, May 26, 2014 9:38:10 AM
> Subject: [VOTE] Release Apache Spark 1.0.0 (RC11)
> 
> Please vote on releasing the following candidate as Apache Spark version
> 1.0.0!
> 
> This has a few important bug fixes on top of rc10:
> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
> SPARK-1870: https://github.com/apache/spark/pull/848
> SPARK-1897: https://github.com/apache/spark/pull/849
> 
> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/tdas.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1019/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
> 
> Please vote on releasing this package as Apache Spark 1.0.0!
> 
> The vote is open until Thursday, May 29, at 16:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 1.0.0
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see
> http://spark.apache.org/
> 
> == API Changes ==
> We welcome users to compile Spark applications against 1.0. There are
> a few API changes in this release. Here are links to the associated
> upgrade guides - user facing changes have been kept as small as
> possible.
> 
> Changes to ML vector specification:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
> 
> Changes to the Java API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> 
> Changes to the streaming API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
> 
> Changes to the GraphX API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
> 
> Other changes:
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
> 
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior
> 


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Henry Saputra
NOTICE and LICENSE files look good
Signatures look good.
Hashes look good
No external executables in the source distributions
Source compiled with sbt
Run local and standalone examples look good.

+1


- Henry

On Mon, May 26, 2014 at 7:38 AM, Tathagata Das
 wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.0.0!
>
> This has a few important bug fixes on top of rc10:
> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
> SPARK-1870: https://github.com/apache/spark/pull/848
> SPARK-1897: https://github.com/apache/spark/pull/849
>
> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/tdas.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1019/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
>
> Please vote on releasing this package as Apache Spark 1.0.0!
>
> The vote is open until Thursday, May 29, at 16:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.0.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == API Changes ==
> We welcome users to compile Spark applications against 1.0. There are
> a few API changes in this release. Here are links to the associated
> upgrade guides - user facing changes have been kept as small as
> possible.
>
> Changes to ML vector specification:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
>
> Changes to the Java API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> Changes to the streaming API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>
> Changes to the GraphX API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>
> Other changes:
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior


LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Bharath Ravi Kumar
I'm looking to reuse the LogisticRegression model (with SGD) to predict a
real-valued outcome variable. (I understand that logistic regression is
generally applied to predict binary outcome, but for various reasons, this
model suits our needs better than LinearRegression). Related to that I have
the following questions:

1) Can the current LogisticRegression model be used as is to train based on
binary input (i.e. explanatory) features, or is there an assumption that
the explanatory features must be continuous?

2) I intend to reuse the current class to train a model on LabeledPoints
where the label is a real value (and not 0 / 1). I'd like to know if
invoking setValidateData(false) would suffice or if one must override the
validator to achieve this.

3) I recall seeing an experimental method on the class (
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala)
that clears the threshold separating positive & negative predictions. Once
the model is trained on real valued labels, would clearing this flag
suffice to predict an outcome that is continous in nature?

Thanks,
Bharath

P.S: I'm writing to dev@ and not user@ assuming that lib changes might be
necessary. Apologies if the mailing list is incorrect.


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Sean McNamara
Pulled down, compiled, and tested examples on OS X and ubuntu.
Deployed app we are building on spark and poured data through it.

+1

Sean


On May 26, 2014, at 8:39 AM, Tathagata Das  wrote:

> Please vote on releasing the following candidate as Apache Spark version 
> 1.0.0!
> 
> This has a few important bug fixes on top of rc10:
> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
> SPARK-1870: https://github.com/apache/spark/pull/848
> SPARK-1897: https://github.com/apache/spark/pull/849
> 
> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/tdas.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1019/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
> 
> Please vote on releasing this package as Apache Spark 1.0.0!
> 
> The vote is open until Thursday, May 29, at 16:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 1.0.0
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see
> http://spark.apache.org/
> 
> == API Changes ==
> We welcome users to compile Spark applications against 1.0. There are
> a few API changes in this release. Here are links to the associated
> upgrade guides - user facing changes have been kept as small as
> possible.
> 
> Changes to ML vector specification:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
> 
> Changes to the Java API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> 
> Changes to the streaming API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
> 
> Changes to the GraphX API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
> 
> Other changes:
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
> 
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior



ContextCleaner, weak references, and serialization

2014-05-28 Thread Will Benton
Friends, 

For context (so to speak), I did some work in the 0.9 timeframe to fix 
SPARK-897 (provide immediate feedback when closures aren't serializable) and 
SPARK-729 (make sure that free variables in closures are captured when the RDD 
transformations are declared).

I currently have a branch addressing SPARK-897 that builds and tests out 
against 0.9, 1.0, and master last I checked 
(https://github.com/apache/spark/pull/143).  My branch addressing SPARK-729 
builds on my SPARK-897 branch, and passed the test suite in 0.9[1].  However, 
some things that changed or were added in 1.0 wound up depending on the old 
behavior.  I've been working on other things lately but would like to get these 
issues fixed after 1.0 goes final so I was hoping to get a bit of discussion on 
the best way to go forward with an issue that I haven't solved yet:

ContextCleaner uses weak references to track broadcast variables.  Because weak 
references obviously don't track cloned objects (or those that have been 
serialized and deserialized), capturing free variables in closures in the 
obvious way (i.e. by replacing the closure with a copy that has been serialized 
and deserialized) results in an undesirable situation:  we might have, e.g., 
live HTTP broadcast variable objects referring to filesystem resources that 
could be cleaned at any time because the objects that they were cloned from 
have become only weakly reachable.

To be clear, this isn't a problem now; it's only a problem for the way I'm 
proposing to fix SPARK-729.  With that said, I'm wondering if it would make 
more sense to fix this problem by adding a layer of indirection to reference 
count external and persisting resources rather than the objects that putatively 
own them, or if it would make more sense to take a more sophisticated (but also 
more potentially fragile) approach to ensuring variable capture.



thanks,
wb


[1] Serializing closures also created or uncovered a PySpark issue in 0.9 (and 
presumably in later versions as well) that requires further investigation, but 
my patch did include a workaround; here are the details: 
https://issues.apache.org/jira/browse/SPARK-1454


Re: Kryo serialization for closures: a workaround

2014-05-28 Thread Will Benton
This is an interesting approach, Nilesh!

Someone will correct me if I'm wrong, but I don't think this could go into 
ClosureCleaner as a default behavior (since Kryo apparently breaks on some 
classes that depend on custom Java serializers, as has come up on the list 
recently).  But it does seem like having a function in Spark that did this for 
closures more transparently (to be called explicitly by clients in problem 
cases) could be pretty useful.


best,
wb


- Original Message -
> From: "Nilesh" 
> To: d...@spark.incubator.apache.org
> Sent: Saturday, May 24, 2014 10:32:57 AM
> Subject: Kryo serialization for closures: a workaround
> 
> Suppose my mappers can be functions (def) that internally call other classes
> and create objects and do different things inside. (Or they can even be
> classes that extend (Foo) => Bar and do the processing in their apply method
> - but let's ignore this case for now)
> 
> Spark supports only Java Serialization for closures and forces all the
> classes inside to implement Serializable and coughs up errors when forced to
> use Kryo for closures. But one cannot expect all 3rd party libraries to have
> all classes extend Serializable!
> 
> Here's a workaround that I thought I'd share in case anyone comes across
> this problem:
> 
> You simply need to serialize the objects before passing through the closure,
> and de-serialize afterwards. This approach just works, even if your classes
> aren't Serializable, because it uses Kryo behind the scenes. All you need is
> some curry. ;) Here's an example of how I did it:
> 
> def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
> (foo: Foo) : Bar = {kryoWrapper.value.apply(foo)}val mapper =
> genMapper(KryoSerializationWrapper(new Blah(abc)))
> _rdd.flatMap(mapper).collectAsMap()object Blah(abc: ABC) extends (Foo =>
> Bar) {def apply(foo: Foo) : Bar = { //This is the real function }}
> Feel free to make Blah as complicated as you want, class, companion object,
> nested classes, references to multiple 3rd party libs.
> 
> KryoSerializationWrapper refers to  this wrapper from amplab/shark
> 
> 
> Don't you think it's a good idea to have something like this inside the
> framework itself? :)
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Kryo-serialization-for-closures-a-workaround-tp6787.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Tom Graves
+1. Tested spark on yarn (cluster mode, client mode, pyspark, spark-shell) on 
hadoop 0.23 and 2.4. 

Tom


On Wednesday, May 28, 2014 3:07 PM, Sean McNamara  
wrote:
 


Pulled down, compiled, and tested examples on OS X and ubuntu.
Deployed app we are building on spark and poured data through it.

+1

Sean



On May 26, 2014, at 8:39 AM, Tathagata Das  wrote:

> Please vote on releasing the following candidate as Apache Spark version 
> 1.0.0!
> 
> This has a few important bug fixes on top of rc10:
> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
> SPARK-1870: https://github.com/apache/spark/pull/848
> SPARK-1897: https://github.com/apache/spark/pull/849
> 
> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11/
> 
> Release
 artifacts are signed with the following key:
> https://people.apache.org/keys/committer/tdas.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1019/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
> 
> Please vote on releasing this package as Apache Spark 1.0.0!
> 
> The vote is open until
 Thursday, May 29, at 16:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 1.0.0
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see
> http://spark.apache.org/
> 
> == API Changes ==
> We welcome users to compile Spark applications against 1.0. There are
> a few API changes in this release. Here are links to the associated
> upgrade guides - user facing changes have been kept as small as
> possible.
> 
> Changes to ML vector specification:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
> 
> Changes to the Java API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> 
> Changes to the streaming API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
> 
> Changes to the GraphX API:
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
> 
> Other changes:
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
> 
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior

Re: LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Xiangrui Meng
Please find my comments inline. -Xiangrui

On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar
 wrote:
> I'm looking to reuse the LogisticRegression model (with SGD) to predict a
> real-valued outcome variable. (I understand that logistic regression is
> generally applied to predict binary outcome, but for various reasons, this
> model suits our needs better than LinearRegression). Related to that I have
> the following questions:
>
> 1) Can the current LogisticRegression model be used as is to train based on
> binary input (i.e. explanatory) features, or is there an assumption that
> the explanatory features must be continuous?
>

Binary features should be okay.

> 2) I intend to reuse the current class to train a model on LabeledPoints
> where the label is a real value (and not 0 / 1). I'd like to know if
> invoking setValidateData(false) would suffice or if one must override the
> validator to achieve this.
>

I'm not sure whether the loss function makes sense with real valued
labels. We may use the assumption that the label is binary to simplify
the computation of loss. You can take a look at the code and see
whether the loss function fits your model.

> 3) I recall seeing an experimental method on the class (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala)
> that clears the threshold separating positive & negative predictions. Once
> the model is trained on real valued labels, would clearing this flag
> suffice to predict an outcome that is continous in nature?
>

If you clear the threshold, it outputs the raw scores from the
logistic function.

> Thanks,
> Bharath
>
> P.S: I'm writing to dev@ and not user@ assuming that lib changes might be
> necessary. Apologies if the mailing list is incorrect.


Re: Standard preprocessing/scaling

2014-05-28 Thread Xiangrui Meng
RowMatrix has a method to compute column summary statistics. There is
a trade-off here because centering may densify the data. A utility
function that centers data would be useful for dense datasets.
-Xiangrui

On Wed, May 28, 2014 at 5:03 AM, dataginjaninja
 wrote:
> I searched on this, but didn't find anything general so I apologize if this
> has been addressed.
>
> Many algorithms (SGD, SVM...) either will not converge or will run forever
> if the data is not scaled. Sci-kit has  preprocessing
> 
> that will subtract the mean and divide by standard dev. Of course there are
> a few options with it as well.
>
> Is there something in the works for this?
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Standard-preprocessing-scaling-tp6826.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Xiangrui Meng
+1

Tested apps with standalone client mode and yarn cluster and client modes.

Xiangrui

On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
 wrote:
> Pulled down, compiled, and tested examples on OS X and ubuntu.
> Deployed app we are building on spark and poured data through it.
>
> +1
>
> Sean
>
>
> On May 26, 2014, at 8:39 AM, Tathagata Das  
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.0.0!
>>
>> This has a few important bug fixes on top of rc10:
>> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
>> SPARK-1870: https://github.com/apache/spark/pull/848
>> SPARK-1897: https://github.com/apache/spark/pull/849
>>
>> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/tdas.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1019/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.0.0!
>>
>> The vote is open until Thursday, May 29, at 16:00 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.0.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == API Changes ==
>> We welcome users to compile Spark applications against 1.0. There are
>> a few API changes in this release. Here are links to the associated
>> upgrade guides - user facing changes have been kept as small as
>> possible.
>>
>> Changes to ML vector specification:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
>>
>> Changes to the Java API:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>>
>> Changes to the streaming API:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>>
>> Changes to the GraphX API:
>> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>>
>> Other changes:
>> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> ==> Call toSeq on the result to restore the old behavior
>>
>> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> ==> Call toSeq on the result to restore old behavior
>


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Andy Konwinski
+1
On May 28, 2014 7:05 PM, "Xiangrui Meng"  wrote:

> +1
>
> Tested apps with standalone client mode and yarn cluster and client modes.
>
> Xiangrui
>
> On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
>  wrote:
> > Pulled down, compiled, and tested examples on OS X and ubuntu.
> > Deployed app we are building on spark and poured data through it.
> >
> > +1
> >
> > Sean
> >
> >
> > On May 26, 2014, at 8:39 AM, Tathagata Das 
> wrote:
> >
> >> Please vote on releasing the following candidate as Apache Spark
> version 1.0.0!
> >>
> >> This has a few important bug fixes on top of rc10:
> >> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
> >> SPARK-1870: https://github.com/apache/spark/pull/848
> >> SPARK-1897: https://github.com/apache/spark/pull/849
> >>
> >> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> http://people.apache.org/~tdas/spark-1.0.0-rc11/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/tdas.asc
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1019/
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
> >>
> >> Please vote on releasing this package as Apache Spark 1.0.0!
> >>
> >> The vote is open until Thursday, May 29, at 16:00 UTC and passes if
> >> a majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 1.0.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see
> >> http://spark.apache.org/
> >>
> >> == API Changes ==
> >> We welcome users to compile Spark applications against 1.0. There are
> >> a few API changes in this release. Here are links to the associated
> >> upgrade guides - user facing changes have been kept as small as
> >> possible.
> >>
> >> Changes to ML vector specification:
> >>
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
> >>
> >> Changes to the Java API:
> >>
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> >>
> >> Changes to the streaming API:
> >>
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
> >>
> >> Changes to the GraphX API:
> >>
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
> >>
> >> Other changes:
> >> coGroup and related functions now return Iterable[T] instead of Seq[T]
> >> ==> Call toSeq on the result to restore the old behavior
> >>
> >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> >> ==> Call toSeq on the result to restore old behavior
> >
>


Re: LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Christopher Nguyen
Bharath, (apologies if you're already familiar with the theory): the
proposed approach may or may not be appropriate depending on the overall
transfer function in your data. In general, a single logistic regressor
cannot approximate arbitrary non-linear functions (of linear combinations
of the inputs). You can review works by, e.g., Hornik and Cybenko in the
late 80's to see if you need something more, such as a simple, one
hidden-layer neural network.

This is a good summary:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.2647&rep=rep1&type=pdf

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar wrote:

> I'm looking to reuse the LogisticRegression model (with SGD) to predict a
> real-valued outcome variable. (I understand that logistic regression is
> generally applied to predict binary outcome, but for various reasons, this
> model suits our needs better than LinearRegression). Related to that I have
> the following questions:
>
> 1) Can the current LogisticRegression model be used as is to train based on
> binary input (i.e. explanatory) features, or is there an assumption that
> the explanatory features must be continuous?
>
> 2) I intend to reuse the current class to train a model on LabeledPoints
> where the label is a real value (and not 0 / 1). I'd like to know if
> invoking setValidateData(false) would suffice or if one must override the
> validator to achieve this.
>
> 3) I recall seeing an experimental method on the class (
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
> )
> that clears the threshold separating positive & negative predictions. Once
> the model is trained on real valued labels, would clearing this flag
> suffice to predict an outcome that is continous in nature?
>
> Thanks,
> Bharath
>
> P.S: I'm writing to dev@ and not user@ assuming that lib changes might be
> necessary. Apologies if the mailing list is incorrect.
>


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Krishna Sankar
+1
Pulled & built on MacOS X, EC2 Amazon Linux
Ran test programs on OS X, 5 node c3.4xlarge cluster
Cheers



On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski wrote:

> +1
> On May 28, 2014 7:05 PM, "Xiangrui Meng"  wrote:
>
> > +1
> >
> > Tested apps with standalone client mode and yarn cluster and client
> modes.
> >
> > Xiangrui
> >
> > On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
> >  wrote:
> > > Pulled down, compiled, and tested examples on OS X and ubuntu.
> > > Deployed app we are building on spark and poured data through it.
> > >
> > > +1
> > >
> > > Sean
> > >
> > >
> > > On May 26, 2014, at 8:39 AM, Tathagata Das <
> tathagata.das1...@gmail.com>
> > wrote:
> > >
> > >> Please vote on releasing the following candidate as Apache Spark
> > version 1.0.0!
> > >>
> > >> This has a few important bug fixes on top of rc10:
> > >> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
> > >> SPARK-1870: https://github.com/apache/spark/pull/848
> > >> SPARK-1897: https://github.com/apache/spark/pull/849
> > >>
> > >> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
> > >>
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
> > >>
> > >> The release files, including signatures, digests, etc. can be found
> at:
> > >> http://people.apache.org/~tdas/spark-1.0.0-rc11/
> > >>
> > >> Release artifacts are signed with the following key:
> > >> https://people.apache.org/keys/committer/tdas.asc
> > >>
> > >> The staging repository for this release can be found at:
> > >>
> https://repository.apache.org/content/repositories/orgapachespark-1019/
> > >>
> > >> The documentation corresponding to this release can be found at:
> > >> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
> > >>
> > >> Please vote on releasing this package as Apache Spark 1.0.0!
> > >>
> > >> The vote is open until Thursday, May 29, at 16:00 UTC and passes if
> > >> a majority of at least 3 +1 PMC votes are cast.
> > >>
> > >> [ ] +1 Release this package as Apache Spark 1.0.0
> > >> [ ] -1 Do not release this package because ...
> > >>
> > >> To learn more about Apache Spark, please see
> > >> http://spark.apache.org/
> > >>
> > >> == API Changes ==
> > >> We welcome users to compile Spark applications against 1.0. There are
> > >> a few API changes in this release. Here are links to the associated
> > >> upgrade guides - user facing changes have been kept as small as
> > >> possible.
> > >>
> > >> Changes to ML vector specification:
> > >>
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
> > >>
> > >> Changes to the Java API:
> > >>
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> > >>
> > >> Changes to the streaming API:
> > >>
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
> > >>
> > >> Changes to the GraphX API:
> > >>
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
> > >>
> > >> Other changes:
> > >> coGroup and related functions now return Iterable[T] instead of Seq[T]
> > >> ==> Call toSeq on the result to restore the old behavior
> > >>
> > >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> > >> ==> Call toSeq on the result to restore old behavior
> > >
> >
>


Re: Standard preprocessing/scaling

2014-05-28 Thread DB Tsai
Sometimes for this case, I will just standardize without centerization. I
still get good result.

Sent from my Google Nexus 5
On May 28, 2014 7:03 PM, "Xiangrui Meng"  wrote:

> RowMatrix has a method to compute column summary statistics. There is
> a trade-off here because centering may densify the data. A utility
> function that centers data would be useful for dense datasets.
> -Xiangrui
>
> On Wed, May 28, 2014 at 5:03 AM, dataginjaninja
>  wrote:
> > I searched on this, but didn't find anything general so I apologize if
> this
> > has been addressed.
> >
> > Many algorithms (SGD, SVM...) either will not converge or will run
> forever
> > if the data is not scaled. Sci-kit has  preprocessing
> > <
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> >
> > that will subtract the mean and divide by standard dev. Of course there
> are
> > a few options with it as well.
> >
> > Is there something in the works for this?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Standard-preprocessing-scaling-tp6826.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Kevin Markey

+1

Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
Ran current version of one of my applications on 1-node pseudocluster 
(sorry, unable to test on full cluster).

yarn-cluster mode
Ran regression tests.

Thanks
Kevin

On 05/28/2014 09:55 PM, Krishna Sankar wrote:

+1
Pulled & built on MacOS X, EC2 Amazon Linux
Ran test programs on OS X, 5 node c3.4xlarge cluster
Cheers



On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski wrote:


+1
On May 28, 2014 7:05 PM, "Xiangrui Meng"  wrote:


+1

Tested apps with standalone client mode and yarn cluster and client

modes.

Xiangrui

On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
 wrote:

Pulled down, compiled, and tested examples on OS X and ubuntu.
Deployed app we are building on spark and poured data through it.

+1

Sean


On May 26, 2014, at 8:39 AM, Tathagata Das <

tathagata.das1...@gmail.com>

wrote:

Please vote on releasing the following candidate as Apache Spark

version 1.0.0!

This has a few important bug fixes on top of rc10:
SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
SPARK-1870: https://github.com/apache/spark/pull/848
SPARK-1897: https://github.com/apache/spark/pull/849

The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):


https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

The release files, including signatures, digests, etc. can be found

at:

http://people.apache.org/~tdas/spark-1.0.0-rc11/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:


https://repository.apache.org/content/repositories/orgapachespark-1019/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Thursday, May 29, at 16:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:


http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

Changes to the Java API:


http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

Changes to the streaming API:


http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

Changes to the GraphX API:


http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

Other changes:
coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior




Suggestion: RDD cache depth

2014-05-28 Thread innowireless TaeYun Kim
It would be nice if the RDD cache() method incorporate a depth information.

That is,

 

void test()
{

JavaRDD<.> rdd = .;

 

rdd.cache();  // to depth 1. actual caching happens.

rdd.cache();  // to depth 2. Nop as long as the storage level is the same.
Else, exception.

.

rdd.uncache();  // to depth 1. Nop.

rdd.uncache();  // to depth 0. Actual unpersist happens.

}

 

This can be useful when writing code in modular way.

When a function receives an rdd as an argument, it doesn't necessarily know
the cache status of the rdd.

But it could want to cache the rdd, since it will use the rdd multiple
times.

But with the current RDD API, it cannot determine whether it should
unpersist it or leave it alone (so that caller can continue to use that rdd
without rebuilding).

 

Thanks.

 



GraphX triplets on 5-node graph

2014-05-28 Thread Michael Malak
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I 
missing something fundamental?


val nodes = sc.parallelize(Array((1L, "N1"), (2L, "N2"), (3L, "N3"), (4L, 
"N4"), (5L, "N5"))) 
val edges = sc.parallelize(Array(Edge(1L, 2L, "E1"), Edge(1L, 3L, "E2"), 
Edge(2L, 4L, "E3"), Edge(3L, 5L, "E4"))) 
Graph(nodes, edges).triplets.collect 
res1: Array[org.apache.spark.graphx.EdgeTriplet[String,String]] = 
Array(((1,N1),(3,N3),E2), ((1,N1),(3,N3),E2), ((3,N3),(5,N5),E4), 
((3,N3),(5,N5),E4))


Re: Suggestion: RDD cache depth

2014-05-28 Thread Matei Zaharia
This is a pretty cool idea — instead of cache depth I’d call it something like 
reference counting. Would you mind opening a JIRA issue about it?

The issue of really composing together libraries that use RDDs nicely isn’t 
fully explored, but this is certainly one thing that would help with it. I’d 
love to look at other ones too, e.g. how to allow libraries to share scans over 
the same dataset.

Unfortunately using multiple cache() calls for this is probably not feasible 
because it would change the current meaning of multiple calls. But we can add a 
new API, or a parameter to the method.

Matei

On May 28, 2014, at 11:46 PM, innowireless TaeYun Kim 
 wrote:

> It would be nice if the RDD cache() method incorporate a depth information.
> 
> That is,
> 
> 
> 
> void test()
> {
> 
> JavaRDD<.> rdd = .;
> 
> 
> 
> rdd.cache();  // to depth 1. actual caching happens.
> 
> rdd.cache();  // to depth 2. Nop as long as the storage level is the same.
> Else, exception.
> 
> .
> 
> rdd.uncache();  // to depth 1. Nop.
> 
> rdd.uncache();  // to depth 0. Actual unpersist happens.
> 
> }
> 
> 
> 
> This can be useful when writing code in modular way.
> 
> When a function receives an rdd as an argument, it doesn't necessarily know
> the cache status of the rdd.
> 
> But it could want to cache the rdd, since it will use the rdd multiple
> times.
> 
> But with the current RDD API, it cannot determine whether it should
> unpersist it or leave it alone (so that caller can continue to use that rdd
> without rebuilding).
> 
> 
> 
> Thanks.
> 
> 
> 



Re: GraphX triplets on 5-node graph

2014-05-28 Thread Reynold Xin
Take a look at this one: https://issues.apache.org/jira/browse/SPARK-1188

It was an optimization that added user inconvenience. We got rid of that
now in Spark 1.0.



On Wed, May 28, 2014 at 11:48 PM, Michael Malak wrote:

> Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or
> am I missing something fundamental?
>
>
> val nodes = sc.parallelize(Array((1L, "N1"), (2L, "N2"), (3L, "N3"), (4L,
> "N4"), (5L, "N5")))
> val edges = sc.parallelize(Array(Edge(1L, 2L, "E1"), Edge(1L, 3L, "E2"),
> Edge(2L, 4L, "E3"), Edge(3L, 5L, "E4")))
> Graph(nodes, edges).triplets.collect
> res1: Array[org.apache.spark.graphx.EdgeTriplet[String,String]] =
> Array(((1,N1),(3,N3),E2), ((1,N1),(3,N3),E2), ((3,N3),(5,N5),E4),
> ((3,N3),(5,N5),E4))
>