[jira] [Commented] (SPARK-1222) Logistic Regression (+ regularized variants)
[ https://issues.apache.org/jira/browse/SPARK-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962007#comment-13962007 ] Martin Jaggi commented on SPARK-1222: - is resolved, right? Logistic Regression (+ regularized variants) Key: SPARK-1222 URL: https://issues.apache.org/jira/browse/SPARK-1222 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ameet Talwalkar Assignee: Shivaram Venkataraman Implement Logistic Regression using the SGD optimization primitives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
[ https://issues.apache.org/jira/browse/SPARK-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-1433: - Description: HBase 0.14.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 was: HBase 0.14.0 was released 6 months ago. Upgrade HBase dependency to 0.17.0 Upgrade Mesos dependency to 0.17.0 -- Key: SPARK-1433 URL: https://issues.apache.org/jira/browse/SPARK-1433 Project: Spark Issue Type: Task Reporter: Sandeep Singh Priority: Minor HBase 0.14.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
[ https://issues.apache.org/jira/browse/SPARK-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-1433: - Description: Mesos 0.14.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 was: HBase 0.14.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 Upgrade Mesos dependency to 0.17.0 -- Key: SPARK-1433 URL: https://issues.apache.org/jira/browse/SPARK-1433 Project: Spark Issue Type: Task Reporter: Sandeep Singh Priority: Minor Mesos 0.14.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1434) Make labelParser Java friendly.
[ https://issues.apache.org/jira/browse/SPARK-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1434: - Component/s: MLlib Make labelParser Java friendly. --- Key: SPARK-1434 URL: https://issues.apache.org/jira/browse/SPARK-1434 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Fix For: 1.0.0 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1432) Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker
[ https://issues.apache.org/jira/browse/SPARK-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1432: --- Assignee: Davis Shepherd Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker - Key: SPARK-1432 URL: https://issues.apache.org/jira/browse/SPARK-1432 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 0.9.0 Reporter: Davis Shepherd Assignee: Davis Shepherd Fix For: 1.0.0, 0.9.2 JobProgressTracker continuously cleans up old metadata as per the spark.ui.retainedStages configuration parameter. It seems however that not all metadata maps are being cleaned, in particular stageIdToExecutorSummaries could grow in an unbounded manner in a long running application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1432) Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker
[ https://issues.apache.org/jira/browse/SPARK-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1432. Resolution: Fixed Fix Version/s: 0.9.2 1.0.0 Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker - Key: SPARK-1432 URL: https://issues.apache.org/jira/browse/SPARK-1432 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 0.9.0 Reporter: Davis Shepherd Assignee: Davis Shepherd Fix For: 1.0.0, 0.9.2 JobProgressTracker continuously cleans up old metadata as per the spark.ui.retainedStages configuration parameter. It seems however that not all metadata maps are being cleaned, in particular stageIdToExecutorSummaries could grow in an unbounded manner in a long running application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962026#comment-13962026 ] Matei Zaharia commented on SPARK-1021: -- Note that if we do this, we'll need a similar fix in Python, which may be trickier. sortByKey() launches a cluster job when it shouldn't Key: SPARK-1021 URL: https://issues.apache.org/jira/browse/SPARK-1021 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.0, 0.9.0 Reporter: Andrew Ash Labels: starter The sortByKey() method is listed as a transformation, not an action, in the documentation. But it launches a cluster job regardless. http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html Some discussion on the mailing list suggested that this is a problem with the rdd.count() call inside Partitioner.scala's rangeBounds method. https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 Josh Rosen suggests that rangeBounds should be made into a lazy variable: {quote} I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). We'd need to make sure that rangeBounds() is never called before an action is performed. This could be tricky because it's called in the RangePartitioner.equals() method. Maybe it's sufficient to just compare the number of partitions, the ids of the RDDs used to create the RangePartitioner, and the sort ordering. This still supports the case where I range-partition one RDD and pass the same partitioner to a different RDD. It breaks support for the case where two range partitioners created on different RDDs happened to have the same rangeBounds(), but it seems unlikely that this would really harm performance since it's probably unlikely that the range partitioners are equal by chance. {quote} Can we please make this happen? I'll send a PR on GitHub to start the discussion and testing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1403) Mesos on Spark does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1403: --- Summary: Mesos on Spark does not set Thread's context class loader (was: java.lang.ClassNotFoundException - spark on mesos) Mesos on Spark does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1403: --- Summary: Spark on Mesos does not set Thread's context class loader (was: Mesos on Spark does not set Thread's context class loader) Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962032#comment-13962032 ] Patrick Wendell commented on SPARK-1403: The underlying issue here is that we've made assumptions in various parts of the codebase that the context classloader is set on a thread. In general, we should relax these assumptions and just fallback to the classloader that loaded Spark. As a workaround this patch: https://github.com/apache/spark/pull/322/files just manually sets the classloader to the system class loader. Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962048#comment-13962048 ] Xiangrui Meng commented on SPARK-1406: -- I think we should support PMML import/export in MLlib. PMML also provides feature transformations, which MLlib has very limited support at this time. The question is 1) how we take leverage on existing PMML packages, 2) how many people volunteer. Sean, it would be super helpful if you can share some experience on Oryx's PMML support, since I'm also not sure about whether this is the right time to start. PMML model evaluation support via MLib -- Key: SPARK-1406 URL: https://issues.apache.org/jira/browse/SPARK-1406 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Thomas Darimont It would be useful if spark would provide support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use analytical models that were created with a statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the actual model evaluation for a given input tuple. The PMML model would then just contain the parameterization of an analytical model. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1218) Minibatch SGD with random sampling
[ https://issues.apache.org/jira/browse/SPARK-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1218. -- Resolution: Fixed Fix Version/s: 0.9.0 Fixed in 0.9.0 or an earlier version. Minibatch SGD with random sampling -- Key: SPARK-1218 URL: https://issues.apache.org/jira/browse/SPARK-1218 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ameet Talwalkar Assignee: Shivaram Venkataraman Fix For: 0.9.0 Takes a gradient function as input. At each iteration, we run stochastic gradient descent locally on each worker with a fraction of the data points selected randomly and with replacement (i.e., sampled points may overlap across iterations). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1217) Add proximal gradient updater.
[ https://issues.apache.org/jira/browse/SPARK-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1217. -- Resolution: Fixed Fix Version/s: 0.9.0 Add proximal gradient updater. -- Key: SPARK-1217 URL: https://issues.apache.org/jira/browse/SPARK-1217 Project: Spark Issue Type: Bug Components: MLlib Reporter: Ameet Talwalkar Fix For: 0.9.0 Add proximal gradient updater, in particular for L1 regularization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1219) Minibatch SGD with disjoint partitions
[ https://issues.apache.org/jira/browse/SPARK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1219. -- Resolution: Fixed Implemented in 0.9.0 or an earlier version. Minibatch SGD with disjoint partitions -- Key: SPARK-1219 URL: https://issues.apache.org/jira/browse/SPARK-1219 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ameet Talwalkar Takes a gradient function as input. At each iteration, we run stochastic gradient descent locally on each worker with a fraction (alpha) of the data points selected randomly and disjointly (i.e., we ensure that we touch all datapoints after at most 1/alpha iterations). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1099) Allow inferring number of cores with local[*]
[ https://issues.apache.org/jira/browse/SPARK-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-1099. --- Resolution: Fixed Allow inferring number of cores with local[*] - Key: SPARK-1099 URL: https://issues.apache.org/jira/browse/SPARK-1099 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Minor Fix For: 1.0.0 It seems reasonable that the default number of cores used by spark's local mode (when no value is specified) is drawn from the spark.cores.max configuration parameter (which, conveniently, is now settable as a command-line option in spark-shell). For the sake of consistency, it's probable that this change would also entail making the default number of cores when spark.cores.max is NOT specified to be as many logical cores are on the machine (which is what standalone mode does). This too seems reasonable, as Spark is inherently a distributed system and I think it's expected that it should use multiple cores by default. However, it is a behavioral change, and thus requires caution. -- This message was sent by Atlassian JIRA (v6.2#6252)