[GitHub] spark pull request: SPARK-1216. Add a OneHotEncoder for handling c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/304#issuecomment-40335789 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14109/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1216. Add a OneHotEncoder for handling c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/304#issuecomment-40335788 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Decision Tree documentation for MLlib programm...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/402#issuecomment-40336644 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Decision Tree documentation for MLlib programm...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/402#issuecomment-40336645 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14110/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] SPARK-1430: Support sparse data in Pytho...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/341#discussion_r11573232 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -185,4 +193,39 @@ class SparseVector( } private[mllib] override def toBreeze: BV[Double] = new BSV[Double](indices, values, size) + + override def apply(pos: Int): Double = { +// A more efficient apply() than creating a new Breeze vector --- End diff -- Good point, I'll remove this and split() because they're no longer needed. They were needed when we passed vectors with the label included from Python instead of passing LabeledPoint. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: misleading task number of groupByKey
GitHub user CrazyJvm opened a pull request: https://github.com/apache/spark/pull/403 misleading task number of groupByKey By default, this uses only 8 parallel tasks to do the grouping. is a big misleading. Please refer to https://github.com/apache/spark/pull/389 detail is as following code : code def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse for (r - bySize if r.partitioner.isDefined) { return r.partitioner.get } if (rdd.context.conf.contains(spark.default.parallelism)) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.size) } } /code You can merge this pull request into a Git repository by running: $ git pull https://github.com/CrazyJvm/spark patch-4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/403.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #403 commit 156833643d9ea1479222e9033164e92a9846351c Author: Chen Chao crazy...@gmail.com Date: 2014-04-14T07:39:50Z misleading task number of groupByKey By default, this uses only 8 parallel tasks to do the grouping. is a big misleading. Please refer to https://github.com/apache/spark/pull/389 detail is as following code : code def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse for (r - bySize if r.partitioner.isDefined) { return r.partitioner.get } if (rdd.context.conf.contains(spark.default.parallelism)) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.size) } } /code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: misleading task number of groupByKey
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/403#issuecomment-40340359 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1477: Add the lifecycle interface
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/379#issuecomment-40344734 We are currently a little swamped with Spark 1.0 stuff, we will definitely take a look soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40362749 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40362755 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1488. Resolve scalac feature warnings du...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/404 SPARK-1488. Resolve scalac feature warnings during build For your consideration: scalac currently notes a number of feature warnings during compilation: ``` [warn] there were 65 feature warning(s); re-run with -feature for details ``` Warnings are like: ``` [warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled [warn] by making the implicit value scala.language.implicitConversions visible. [warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions' [warn] or by setting the compiler option -language:implicitConversions. [warn] See the Scala docs for value scala.language.implicitConversions for a discussion [warn] why the feature should be explicitly enabled. [warn] implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) = [warn]^ ``` scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used. This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-1488 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/404.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #404 commit 39bc83115d5a55527e4f893fd480039896b6a63f Author: Sean Owen so...@cloudera.com Date: 2014-04-08T11:24:28Z Enable -feature in scalac to emit language feature warnings commit 859898002573f24c53d458db3e61b91b3c9da841 Author: Sean Owen so...@cloudera.com Date: 2014-04-08T12:09:45Z Quiet scalac warnings about language features by explicitly importing language features. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1488. Resolve scalac feature warnings du...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/404#issuecomment-40364966 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1488. Resolve scalac feature warnings du...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/404#issuecomment-40364977 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40366369 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-40366371 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14111/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1465: Spark compilation is broken with t...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/396#issuecomment-40369872 Spark shouldn't be using it directly since it got marked as private in the Hadoop 2.2 release. I believe Spark was using that api before the 2.2 release so it was easy to miss. Also when it was changed it to private, MapReduce was not updated to stop using it, so Hadoop is breaking its own api rules. These functions are utility functions and could be used by many types of applications so ideally some new class in YARN with these functions is created that is public. I think we should commit this pr (after review) since spark on yarn can't be run against 2.4 release now and then if a new Yarn utility class is created we can look at using that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1465: Spark compilation is broken with t...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/396#issuecomment-40369924 Also note I filed https://issues.apache.org/jira/browse/SPARK-1472 to go through the rest of the YARN apis. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1488. Resolve scalac feature warnings du...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/404#issuecomment-40369322 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14112/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1488. Resolve scalac feature warnings du...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/404#issuecomment-40369320 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...
Github user andy327 commented on the pull request: https://github.com/apache/spark/pull/10#issuecomment-40374419 Alternatively, it can be done without the added algebird dependency, if that's desired.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1408 Modify Spark on Yarn to point to th...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/362#issuecomment-40386021 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1408 Modify Spark on Yarn to point to th...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/362#issuecomment-40386037 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1310: Start adding k-fold cross validati...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/18#issuecomment-40388669 @pwendell Could you help merge this PR into both master and branch-1.0? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1408 Modify Spark on Yarn to point to th...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/362#issuecomment-40390542 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1408 Modify Spark on Yarn to point to th...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/362#issuecomment-40390543 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14113/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1310: Start adding k-fold cross validati...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/18#issuecomment-40392687 @mengxr @holdenk this does not merge cleanly at the moment - there are some conflicts in MLUtils. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478
GitHub user tmalaska opened a pull request: https://github.com/apache/spark/pull/405 SPARK-1478 Initial Version You can merge this pull request into a Git repository by running: $ git pull https://github.com/tmalaska/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/405.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #405 commit c433827db5dfda6f5b1b6aa11e45447525b4aac4 Author: tmalaska ted.mala...@cloudera.com Date: 2014-04-14T17:37:01Z SPARK-1478 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/405#issuecomment-40395599 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [BUGFIX] In-memory columnar storage bug fixes
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/374#issuecomment-40403308 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: manage the DAGScheduler EventProce...
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-40410340 I'll look at it some more tmorrow, but this needs to be rebased to current master -- e.g., diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala index e637ddc..9657cbf 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala @@ -482,12 +482,19 @@ class DAGScheduler( private[scheduler] def doCancelAllJobs() { // Cancel all running jobs. -runningStages.map(_.jobId).foreach(handleJobCancellation) +runningStages.map(_.jobId).foreach(handleJobCancellation(_, as part of cancellation of all jobs)) activeJobs.clear() // These should already be empty by this point, jobIdToActiveJob.clear() // but just in case we lost track of some jobs... } /** + * Cancel all jobs associated with a running or scheduled stage. + */ + def cancelStage(stageId: Int) { +eventProcessActor ! StageCancelled(stageId) + } + + /** * Resubmit any failed stages. Ordinarily called after a small amount of time has passed since * the last fetch failure. */ @@ -849,11 +856,23 @@ class DAGScheduler( } } - private[scheduler] def handleJobCancellation(jobId: Int) { + private[scheduler] def handleStageCancellation(stageId: Int) { +if (stageIdToJobIds.contains(stageId)) { + val jobsThatUseStage: Array[Int] = stageIdToJobIds(stageId).toArray + jobsThatUseStage.foreach(jobId = { +handleJobCancellation(jobId, because Stage %s was cancelled.format(stageId)) + }) +} else { + logInfo(No active jobs to kill for Stage + stageId) +} + } + + private[scheduler] def handleJobCancellation(jobId: Int, reason: String = ) { if (!jobIdToStageIds.contains(jobId)) { logDebug(Trying to cancel unregistered job + jobId) } else { - failJobAndIndependentStages(jobIdToActiveJob(jobId), sJob $jobId cancelled, None) + failJobAndIndependentStages(jobIdToActiveJob(jobId), +sJob $jobId cancelled $reason, None) } } @@ -1060,6 +1079,9 @@ private[scheduler] class DAGSchedulerEventProcessActor(dagScheduler: DAGSchedule dagScheduler.submitStage(finalStage) } +case StageCancelled(stageId) = + dagScheduler.handleStageCancellation(stageId) + case JobCancelled(jobId) = dagScheduler.handleJobCancellation(jobId) @@ -1069,7 +1091,7 @@ private[scheduler] class DAGSchedulerEventProcessActor(dagScheduler: DAGSchedule val activeInGroup = dagScheduler.activeJobs.filter(activeJob = groupId == activeJob.properties.get(SparkContext.SPARK_JOB_GROUP_ID)) val jobIds = activeInGroup.map(_.jobId) - jobIds.foreach(dagScheduler.handleJobCancellation) + jobIds.foreach(dagScheduler.handleJobCancellation(_, sas part of cancelled job group %groupId)) case AllJobsCancelled = dagScheduler.doCancelAllJobs() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/353#discussion_r11605070 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import scala.collection.mutable.ArrayBuffer + +import breeze.linalg.{DenseVector = BDV, axpy} +import breeze.optimize.{CachedDiffFunction, DiffFunction} + +import org.apache.spark.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Vectors, Vector} + +/** + * Class used to solve an optimization problem using Limited-memory BFGS. + * Reference: [[http://en.wikipedia.org/wiki/Limited-memory_BFGS]] + * @param gradient Gradient function to be used. + * @param updater Updater to be used to update weights after every iteration. + */ +class LBFGS(private var gradient: Gradient, private var updater: Updater) + extends Optimizer with Logging { + + private var numCorrections = 10 + private var convergenceTol = 1E-4 + private var maxNumIterations = 100 + private var regParam = 0.0 + private var miniBatchFraction = 1.0 + + /** + * Set the number of corrections used in the LBFGS update. Default 10. + * Values of numCorrections less than 3 are not recommended; large values + * of numCorrections will result in excessive computing time. + * 3 numCorrections 10 is recommended. + * Restriction: numCorrections 0 + */ + def setNumCorrections(corrections: Int): this.type = { +assert(corrections 0) +this.numCorrections = corrections +this + } + + /** + * Set fraction of data to be used for each L-BFGS iteration. Default 1.0. + */ + def setMiniBatchFraction(fraction: Double): this.type = { +this.miniBatchFraction = fraction +this + } + + /** + * Set the convergence tolerance of iterations for L-BFGS. Default 1E-4. + * Smaller value will lead to higher accuracy with the cost of more iterations. + */ + def setConvergenceTol(tolerance: Int): this.type = { +this.convergenceTol = tolerance +this + } + + /** + * Set the maximal number of iterations for L-BFGS. Default 100. + */ + def setMaxNumIterations(iters: Int): this.type = { +this.maxNumIterations = iters +this + } + + /** + * Set the regularization parameter. Default 0.0. + */ + def setRegParam(regParam: Double): this.type = { +this.regParam = regParam +this + } + + /** + * Set the gradient function (of the loss function of one single data example) + * to be used for L-BFGS. + */ + def setGradient(gradient: Gradient): this.type = { +this.gradient = gradient +this + } + + /** + * Set the updater function to actually perform a gradient step in a given direction. + * The updater is responsible to perform the update from the regularization term as well, + * and therefore determines what kind or regularization is used, if any. + */ + def setUpdater(updater: Updater): this.type = { +this.updater = updater +this + } + + override def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = { +val (weights, _) = LBFGS.runMiniBatchLBFGS( + data, + gradient, + updater, + numCorrections, + convergenceTol, + maxNumIterations, + regParam, + miniBatchFraction, + initialWeights) +weights + } + +} + +/** + * Top-level method to run LBFGS. + */ +object LBFGS extends Logging { + /** + * Run Limited-memory BFGS (L-BFGS) in parallel using mini batches. + * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data + * in order to
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40414083 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14117/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1235: manage the DAGScheduler EventProce...
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/186#issuecomment-40414779 Failing RAT checks not related to this PR. This PR runs and passes all the tests for me locally, but I want to take another close look at it tomorrow -- and with any luck, someone will have made Jenkins happy by then On Mon, Apr 14, 2014 at 1:26 PM, Nan Zhu notificati...@github.com wrote: Eh...just rebased, but Jenkins is not happy... â Reply to this email directly or view it on GitHubhttps://github.com/apache/spark/pull/186#issuecomment-40413762 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/405#issuecomment-40416735 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/405#issuecomment-40416619 Jenkins, test this please. @tmalaska mind updating the title of the PR to include the title of the JIRA? It makes it easier when scanning the (long list) of active pull requests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/405#issuecomment-40416751 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/405#issuecomment-40416779 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14118/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: support leftsemijoin for sparkSQL
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/395#issuecomment-40420386 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/405#issuecomment-40420675 @tmalaska I did a cursory pass, this looks good. I will do a more detailed pass soon. However, there something you should know. I am in the middle of a PR ( #300 ) that tweaks the receiver API a little bit for greater stability and so a bit of your code will have a to change a little. This should go in pretty soon (couple of days, max). The PR has the changes necessary for the current FlumeReceiver. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1474: Spark on yarn assembly doesn't inc...
GitHub user tgravescs opened a pull request: https://github.com/apache/spark/pull/406 SPARK-1474: Spark on yarn assembly doesn't include AmIpFilter We use org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter in spark on yarn but are not included it in the assembly jar. I tested this on yarn cluster by removing the yarn jars from the classpath and spark runs fine now. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgravescs/spark SPARK-1474 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/406.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #406 commit 1548bf955a1d2ca410af0b447ad1bcf4840b326e Author: Thomas Graves tgra...@apache.org Date: 2014-04-14T17:52:20Z SPARK-1474: Spark on yarn assembly doesn't include org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1478
Github user tmalaska commented on the pull request: https://github.com/apache/spark/pull/405#issuecomment-40421425 Yeah no problem. Thanks for taking the time to review my code. This is my first time committing with Scala :) Just let me know when ( #300 ) is done and I will re check out. Also when you have time I would love to know how else I could help. I was thinking of adding : - encryption to the Flume Stream as is in Flume 1.4.0. - Fail recover support when a Flume Stream host goes down and Spark starts up the Flume Stream on another node. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1474: Spark on yarn assembly doesn't inc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/406#issuecomment-40421597 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1474: Spark on yarn assembly doesn't inc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/406#issuecomment-40421710 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14119/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1474: Spark on yarn assembly doesn't inc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/406#issuecomment-40421708 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS
GitHub user tmyklebu opened a pull request: https://github.com/apache/spark/pull/407 [SPARK-1281] Improve partitioning in ALS ALS was using HashPartitioner and explicit uses of `%` together. Further, the naked use of `%` meant that, if the number of partitions corresponded with the stride of arithmetic progressions appearing in user and product ids, users and products could be mapped into buckets in an unfair or unwise way. This pull request: 1) Makes the Partitioner an instance variable of ALS. 2) Replaces the direct uses of `%` with calls to a Partitioner. 3) Defines an anonymous Partitioner that scrambles the bits of the object's hashCode before reducing to the number of present buckets. This pull request does not make the partitioner user-configurable. I'm not all that happy about the way I did (1). It introduces an icky lifetime issue and dances around it by nulling something. However, I don't know a better way to make the partitioner visible everywhere it needs to be visible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tmyklebu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/407.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #407 commit c774d7d4bff91c9387d059d1189799fa0ff1f4b0 Author: Tor Myklebust tmykl...@gmail.com Date: 2014-04-14T22:01:18Z Make the partitioner a member variable and use it instead of modding directly. commit c90b6d8e91f86cf89adf28de6f9185647c87e5c8 Author: Tor Myklebust tmykl...@gmail.com Date: 2014-04-14T22:10:30Z Scramble user and product ids before bucketing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [BUGFIX] In-memory columnar storage bug fixes
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/374#issuecomment-40425590 Thanks merged into master and 1.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Clean up and simplify Spark configuration
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/299#issuecomment-40425587 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/407#issuecomment-40425686 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14120/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS
Github user tmyklebu commented on the pull request: https://github.com/apache/spark/pull/407#issuecomment-40425865 Build failure. Looks like a config issue in Jenkins? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Clean up and simplify Spark configuration
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/299#issuecomment-40428500 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Clean up and simplify Spark configuration
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/299#issuecomment-40428502 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14121/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40429267 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [BUGFIX] In-memory columnar storage bug fixes
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/374 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40432076 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40432158 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40432159 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14123/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user ahirreddy commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40432281 MIMA Checker issue because we now include Hive in the assembly jar when building on Jenkins. See Jira SPARK-1494 for more information. https://issues.apache.org/jira/browse/SPARK-1494 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40433881 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Make spark logo link refer to /.
Github user ash211 commented on the pull request: https://github.com/apache/spark/pull/408#issuecomment-40434113 +1 from me -- I've done the URL editing that Marcelo described before. On Tue, Apr 15, 2014 at 12:54 AM, Patrick Wendell notificati...@github.comwrote: This seems like a decent idea - @andrewor14https://github.com/andrewor14 ? â Reply to this email directly or view it on GitHubhttps://github.com/apache/spark/pull/408#issuecomment-40431843 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: support leftsemijoin for sparkSQL
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/395#issuecomment-40434377 Besides the BroadcastNestedLoopJoin, I think the left semi join may also need to be implemented in the HashJoin. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/353 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40434555 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
GitHub user dbtsai reopened a pull request: https://github.com/apache/spark/pull/353 [SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation. This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr ! When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way. Let's review how updater works when returning newWeights given the input parameters. w' = w - thisIterStepSize * (gradient + regGradient(w)) Note that regGradient is function of w! If we set gradient = 0, thisIterStepSize = 1, then regGradient(w) = w - w' As a result, for regVal, it can be computed by val regVal = updater.compute( weights, new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2 and for regGradient, it can be obtained by val regGradient = weights.sub( updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1) The PR includes the tests which compare the result with SGD with/without regularization. We did a comparison between LBFGS and SGD, and often we saw 10x less steps in LBFGS while the cost of per step is the same (just computing the gradient). The following is the paper by Prof. Ng at Stanford comparing different optimizers including LBFGS and SGD. They use them in the context of deep learning, but worth as reference. http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dbtsai-LBFGS Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/353.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #353 commit 984b18e21396eae84656e15da3539ff3b5f3bf4a Author: DB Tsai dbt...@alpinenow.com Date: 2014-04-05T00:06:50Z L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation issue in GradientDescent optimizer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40434626 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40434691 Timeout for lastest jenkins run. It seems that CI is not stable now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40434890 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40434895 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: support leftsemijoin for sparkSQL
Github user adrian-wang commented on the pull request: https://github.com/apache/spark/pull/395#issuecomment-40436922 I'll create a JIRA soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40437823 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1488. Resolve scalac feature warnings du...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/404#issuecomment-40437961 Aha, finally! LGTM and thanks for working on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/407#issuecomment-40438216 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/407#discussion_r11617460 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala --- @@ -96,6 +97,7 @@ class ALS private ( private var lambda: Double, private var implicitPrefs: Boolean, private var alpha: Double, +private var partitioner: Partitioner = null, --- End diff -- Do not put partitioner in constructor args. Use setters and make the hashPartitioner default. Also, should separate userPartitioner/numUserBlocks and productPartitioner/numProductBlocks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/407#issuecomment-40438453 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/407#issuecomment-40438446 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/407#issuecomment-40438518 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14128/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/407#issuecomment-40438517 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1424 Generalize insertIntoTable fu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/354#issuecomment-40439180 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1424 Generalize insertIntoTable fu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/354#issuecomment-40439169 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1424 Generalize insertIntoTable fu...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/354#issuecomment-40439248 Okay, I updated the API based on a conversation with @mateiz. I also added the relevant function to the Java API. We can do python in a follow up PR once that is merged. Once Jenkins passes I think this is ready to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40439428 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40439431 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40439427 Regarding the longer test time, we should make sure that we aren't just comparing to times when the Hive tests weren't running at all. Should definitely look into the increased verbosity of the logs (even thought that might not have been caused by this PR, but by turning the hive tests back on). It is possible that we should just add more packages to `sql/hive/src/main/resources/log4j.properties`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40439479 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14126/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] L-BFGS Optimizer based on ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/353#issuecomment-40439478 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1424 Generalize insertIntoTable fu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/354#issuecomment-40439900 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14127/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1424 Generalize insertIntoTable fu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/354#issuecomment-40439899 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40440452 @marmbrus I see- the duration issue was just that we had stopped running hive tests for a bit after Aaron's build change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40440628 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14130/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40440638 I manually cancelled this build since we'll need to reterst. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40440626 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1424 Generalize insertIntoTable fu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/354#issuecomment-40440870 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14129/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1424 Generalize insertIntoTable fu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/354#issuecomment-40440869 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1488. Resolve scalac feature warnings du...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/404#issuecomment-40440919 Thanks - I've merged this. Good call. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Include stack trace for exceptions thrown by u...
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/409 Include stack trace for exceptions thrown by user code. It is very confusing when your code throws an exception, but the only stack trace show is in the DAGScheduler. This is a simple patch to include the stack trace for the actual failure in the error message. Suggestions on formatting welcome. Before: ``` scala sc.parallelize(1 :: Nil).map(_ = sys.error(Ahh!)).collect() org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times (most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037) ... ``` After: ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times, most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh! scala.sys.package$.error(package.scala:27) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676) org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676) org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048) org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:110) org.apache.spark.scheduler.Task.run(Task.scala:50) org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:46) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1037) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:614) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:143) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
[GitHub] spark pull request: Include stack trace for exceptions thrown by u...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/409#issuecomment-40441434 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1488. Resolve scalac feature warnings du...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/404 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Make spark logo link refer to /.
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/408#issuecomment-40444236 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Make spark logo link refer to /.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/408#issuecomment-40444359 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1374: PySpark API for SparkSQL
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/363#issuecomment-40444360 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Make spark logo link refer to /.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/408#issuecomment-40444365 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---