[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/4934#issuecomment-78192837 Hi, replying to some of the statements above: It seems @staple has already implemented backtracking (because he has results in the JIRA), but kept them out of this PR to keep it simple, so we can tackle that afterwards. I wrote a backtracking implementation (and checked that it performs the same as the tfocs implementation). Currently it is just a port of the tfocs version. Iâd need a little time to make it scala / spark idiomatic, but the turnaround would be pretty fast. For example, if we add line search option, what is the semantic of agd.setStepSize(1.0).useLineSearch() TFOCS supports a suggested initial lipschitz value (variable named âLâ), which is just a starting point for line search, so a corresponding behavior would be to use the step size as an initial suggestion only when line search is enabled. It may be desirable to use a parameter name like âLâ instead of âstepSizeâ to make the meaning clearer. In TFOCS you can disable backtracking line search by setting several parameters (L, Lexact, alpha, and beta) which individually control different aspects of the backtracking implementation. For spark it may make sense to provide backtracking modes that are configured explicitly, for example fixed lipshitz bound (no backtracking), or backtracking line search based on the TFOCS implementation, or possibly an alternative line search implementation that is more conservative about performing round trip aggregations. Then there could be a setBacktrackingMode() setter to configure which mode is used. Moving forward there may be a need to support additional acceleration algorithms in addition to Auslender and Teboulle. These might be configurable via a setAlgorithm() function. Btw, I don't think we need to stick to the current GradientDescent API. The accelerated gradient takes a smooth convex function which provides gradient and optionally the Lipschitz constant. The implementation of Nesterov's method doesn't need to know RDDs. This is good to know. I had been assuming we would stick with the existing GradientDescent api including Gradient and Updater delegates. Currently the applySmooth and applyProjector functions (named the same as corresponding TFOCS functions) serve as a bridge between the acceleration implementation (relatively unaware of RDDs) and spark specific RDD aggregations. This seems like a good time to mention that the backtracking implementation in TFOCS uses a system of caching the (expensive to compute) linear operator component of the objective function, which significantly reduces the cost of backtracking. A similar implementation is possible in spark, though the performance benefit may not be as significant because two round trips would still be required per iteration. (See p. 3 of my design doc linked in the jira for some more detail.) One reason I suggested not implementing linear operator caching in the design doc is because itâs incompatible with the existing Gradient interface. If we are considering an alternative interface it may be worth revisiting this issue. The objective function âinterfaceâ used by TFOCS involves the functions applyLinear (linear operator component of objective), applySmooth (smooth portion of objective), applyProjector (nonsmooth portion of objective). In addition there are a number of numeric and categorical parameters. Theoretically we could adopt a similar interface (with or without applyLinear, depending) where RDD specific operations are encapsulated within the various apply* functions. Finally, I wanted to mention that Iâm in the bay area and am happy to meet in person to discuss this project if that would be helpful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/4934#discussion_r26186113 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/AcceleratedGradientDescent.scala --- @@ -0,0 +1,237 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import scala.collection.mutable.ArrayBuffer + +import breeze.linalg.{DenseVector = BDV, norm} + +import org.apache.spark.Logging +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * This class optimizes a vector of weights via accelerated (proximal) gradient descent. + * The implementation is based on TFOCS [[http://cvxr.com/tfocs]], described in Becker, Candes, and + * Grant 2010. + * @param gradient Delegate that computes the loss function value and gradient for a vector of + * weights. + * @param updater Delegate that updates weights in the direction of a gradient. + */ +@DeveloperApi +class AcceleratedGradientDescent (private var gradient: Gradient, private var updater: Updater) + extends Optimizer { + + private var stepSize: Double = 1.0 + private var convergenceTol: Double = 1e-4 + private var numIterations: Int = 100 + private var regParam: Double = 0.0 + + /** + * Set the initial step size, used for the first step. Default 1.0. + * On subsequent steps, the step size will be adjusted by the acceleration algorithm. + */ + def setStepSize(step: Double): this.type = { +this.stepSize = step +this + } + + /** + * Set the optimization convergence tolerance. Default 1e-4. + * Smaller values will increase accuracy but require additional iterations. + */ + def setConvergenceTol(tol: Double): this.type = { +this.convergenceTol = tol +this + } + + /** + * Set the maximum number of iterations. Default 100. + */ + def setNumIterations(iters: Int): this.type = { +this.numIterations = iters +this + } + + /** + * Set the regularization parameter. Default 0.0. + */ + def setRegParam(regParam: Double): this.type = { +this.regParam = regParam +this + } + + /** + * Set a Gradient delegate for computing the loss function value and gradient. + */ + def setGradient(gradient: Gradient): this.type = { +this.gradient = gradient +this + } + + /** + * Set an Updater delegate for updating weights in the direction of a gradient. + * If regularization is used, the Updater will implement the regularization term's proximity + * operator. Thus the type of regularization penalty is configured by providing a corresponding + * Updater implementation. + */ + def setUpdater(updater: Updater): this.type = { +this.updater = updater +this + } + + /** + * Run accelerated gradient descent on the provided training data. + * @param data training data + * @param initialWeights initial weights + * @return solution vector + */ + def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = { +val (weights, _) = AcceleratedGradientDescent.run( + data, + gradient, + updater, + stepSize, + convergenceTol, + numIterations, + regParam, + initialWeights) +weights + } +} + +/** + * :: DeveloperApi :: + * Top-level method to run accelerated (proximal) gradient descent. + */ +@DeveloperApi +object AcceleratedGradientDescent extends Logging { + /** + * Run accelerated proximal gradient descent. + * The implementation is based on TFOCS [[http://cvxr.com/tfocs]], described in Becker, Candes, + * and Grant 2010
[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/4934#discussion_r25986928 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/AcceleratedGradientDescent.scala --- @@ -0,0 +1,237 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.optimization + +import scala.collection.mutable.ArrayBuffer + +import breeze.linalg.{DenseVector = BDV, norm} + +import org.apache.spark.Logging +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * This class optimizes a vector of weights via accelerated (proximal) gradient descent. + * The implementation is based on TFOCS [[http://cvxr.com/tfocs]], described in Becker, Candes, and + * Grant 2010. + * @param gradient Delegate that computes the loss function value and gradient for a vector of + * weights. + * @param updater Delegate that updates weights in the direction of a gradient. + */ +@DeveloperApi +class AcceleratedGradientDescent (private var gradient: Gradient, private var updater: Updater) + extends Optimizer { + + private var stepSize: Double = 1.0 + private var convergenceTol: Double = 1e-4 + private var numIterations: Int = 100 + private var regParam: Double = 0.0 + + /** + * Set the initial step size, used for the first step. Default 1.0. + * On subsequent steps, the step size will be adjusted by the acceleration algorithm. + */ + def setStepSize(step: Double): this.type = { --- End diff -- @mengxr Thanks for taking a look. I was advised by Reza Zadeh to implement a version without line search, at least for the initial implementation. Please see discussion here: https://issues.apache.org/jira/browse/SPARK-1503?focusedCommentId=14225295, and in the following comments. I also attached some optimization benchmarks to the jira, which include performance of both backtracking line search and non line search implementations. Per your suggestion that it's hard to choose a proper stepSize I can attest that, anecdotally, acceleration seems somewhat more sensitive to diverging with nominal stepSize than the existing gradient descent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/4934 [SPARK-1503][MLLIB] Initial AcceleratedGradientDescent implementation. An implementation of accelerated gradient descent, a first order optimization algorithm with faster asymptotic convergence than standard gradient descent. Design discussion and benchmark results at https://issues.apache.org/jira/browse/SPARK-1503 You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-1503 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4934.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4934 commit a121bd0f6e5f2387bd502976940c08f6a4a2e4b1 Author: Aaron Staple aaron.sta...@gmail.com Date: 2015-02-13T22:24:07Z [SPARK-1503][MLLIB] Initial AcceleratedGradientDescent implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] WIP Add option for distrib...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-76874870 @srowen I think this stalled because I was anticipating some additional feedback on the scala implementation before adding python compatibility. But looking things over I think I should just go ahead and add the rest of the implementation to move from WIP to formal PR. And I will have time to do that in the near future, so let's keep this PR open for now please. Thanks for the ping! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] WIP Add option for distrib...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-58237889 @davies, sure changed the title --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57919277 @mengxr Ok, updated to address your suggestions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57920855 Again, python tests failed because the python interface is disabled in order to focus on the scala implementation first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57824188 @mengxr - Iâm sure you have a lot on your plate right now, but I wanted to check in on this PR. Overall, how do you feel about this approach for SPARK-1655? Iâm happy to implement it differently if youâd prefer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57855256 @mengxr Sorry about that, in the future Iâll follow the best practice youâve outlined. Here are the take-aways from my perspective: - Investigate use of sparse storage for the conditional distribution. I believe the existing implementation in master uses dense conditional distribution matrices, but sparse is obviously possible. - Remove grouping of conditional probabilities, as it adds complexity and you mentioned you arenât sure if it will help performance. - Add support for predictValues with consistent partitioning. Iâll look into all these. Thanks for your feedback! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57867834 @mengxr Sorry I misunderstood your comment on that first point. I'll just do the 2nd and 3rd. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2412#issuecomment-56865408 @davies It looks like in your #2378 you already disabled caching for NaiveBayes and DecisionTree. The only difference from this patch is that I disabled caching for ALS as well. We discussed this a bit here: https://github.com/apache/spark/pull/2378#discussion_r17686208. I filed this ticket as a follow up of the work on uncached input warnings (https://github.com/apache/spark/pull/2347). The warnings are only supposed to be printed if the input data is accessed repeatedly on many iterations during learning. That's not the case with ALS, so a warning shouldn't be printed there. But I can see there's a case for caching because the input data is accessed twice when constructing an intermediate representation of the data. I don't have a strong preference on whether we should or should not cache in python for the ALS learner. If you are fine with continuing to cache in python for ALS, then there's no more work to be done for this ticket, SPARK-3550. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-56876698 Hi, I addressed the recent review comments and merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2412#issuecomment-56887646 @davies, sure will do --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...
Github user staple closed the pull request at: https://github.com/apache/spark/pull/2412 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
Github user staple closed the pull request at: https://github.com/apache/spark/pull/2362 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-56903523 Great, thanks. My username is 'staple', looks like you already assigned to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2491#discussion_r17866654 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -141,7 +214,41 @@ class NaiveBayes private (private var lambda: Double) extends Serializable with i += 1 } -new NaiveBayesModel(labels, pi, theta) +new LocalNaiveBayesModel(labels, pi, theta) + } + + private def trainDistModel(labelAggregates: RDD[(Double, (Long, BDV[Double]))]) = { +case class LabelAggregate(label: Double, numDocuments: Long, sumFeatures: BDV[Double]) +val aggregated = labelAggregates.map(x = LabelAggregate(x._1, x._2._1, x._2._2)) + +// Compute the model's prior (pi) vector and conditional (theta) matrix for each batch of +// labels. +// NOTE In contrast to the local trainer, the piLogDenom normalization term is omitted here. +// Computing this term requires an additional aggregation on 'aggregated', and because the +// term is an additive constant it does not affect maximum a posteriori model prediction. +val modelBlocks = aggregated.mapPartitions(p = p.grouped(100).map { batch = --- End diff -- The batch size of 100 here is just an arbitrary value right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-56414916 Hi - Does this seem like a reasonable approach for SPARK-1655? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2491#discussion_r17866616 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -232,11 +232,11 @@ class PythonMLLibAPI extends Serializable { def trainNaiveBayes( data: JavaRDD[LabeledPoint], lambda: Double): java.util.List[java.lang.Object] = { -val model = NaiveBayes.train(data.rdd, lambda) +// val model = NaiveBayes.train(data.rdd, lambda, local) --- End diff -- I disabled the python interface for now. Letâs figure out the scala implementation first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/2491 [SPARK-1655][MLLIB] Add option for distributed naive bayes model. Adds an option to store a naive bayes model distributively. The default behavior, in which the whole model is stored on the driver node, remains unchanged. NaiveBayes.trainâs new distMode parameter can be used to request that a model be distributed. When distributed, the model is stored as an RDD of model blocks. Each block contains the labels and prior and conditional probabilities for a set of label classes, allowing fast computation of the maximum a posteriori prediction for each block and straightforward aggregation of these MAP predictions across blocks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-1655 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2491.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2491 commit 4594761dd035d2d01b91fb36a9029bda9f34c4a1 Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-09-22T05:02:28Z [SPARK-1655][MLLIB] Add option for distributed naive bayes model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-56425754 Hi - The QA tests failed in python because I disabled the naive bayes python api in order to focus on approval of the scala implementation first. (I mentioned this in a comment above as well.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17686208 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -54,34 +64,51 @@ def __del__(self): def predict(self, user, product): return self._java_model.predict(user, product) -def predictAll(self, usersProducts): -usersProductsJRDD = _get_unmangled_rdd(usersProducts, _serialize_tuple) -return RDD(self._java_model.predict(usersProductsJRDD._jrdd), - self._context, RatingDeserializer()) +def predictAll(self, user_product): +assert isinstance(user_product, RDD), user_product should be RDD of (user, product) +sc = self._context +tuplerdd = sc._jvm.SerDe.asTupleRDD(user_product._to_java_object_rdd().rdd()) +jresult = self._java_model.predict(tuplerdd).toJavaRDD() +return RDD(sc._jvm.PythonRDD.javaToPython(jresult), sc, + AutoBatchedSerializer(PickleSerializer())) class ALS(object): @classmethod +def _prepare(cls, ratings): +assert isinstance(ratings, RDD), ratings should be RDD +first = ratings.first() +if not isinstance(first, Rating): +if isinstance(first, (tuple, list)): +ratings = ratings.map(lambda x: Rating(*x)) +else: +raise ValueError(rating should be RDD of Rating or tuple/list) +# serialize them by AutoBatchedSerializer before cache to reduce the +# objects overhead in JVM +cached = ratings._reserialize(AutoBatchedSerializer(PickleSerializer())).cache() --- End diff -- Hi, just wanted to check on your decision to cache for ALS. It looks like in ALS the makeLinkRDDs calls handle persistence for a transformation of the input data. Though there are two calls to makeLinkRDDs, so perhaps two reads of the input data. Are those two reads the reason for caching here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17687682 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -54,34 +64,51 @@ def __del__(self): def predict(self, user, product): return self._java_model.predict(user, product) -def predictAll(self, usersProducts): -usersProductsJRDD = _get_unmangled_rdd(usersProducts, _serialize_tuple) -return RDD(self._java_model.predict(usersProductsJRDD._jrdd), - self._context, RatingDeserializer()) +def predictAll(self, user_product): +assert isinstance(user_product, RDD), user_product should be RDD of (user, product) +sc = self._context +tuplerdd = sc._jvm.SerDe.asTupleRDD(user_product._to_java_object_rdd().rdd()) +jresult = self._java_model.predict(tuplerdd).toJavaRDD() +return RDD(sc._jvm.PythonRDD.javaToPython(jresult), sc, + AutoBatchedSerializer(PickleSerializer())) class ALS(object): @classmethod +def _prepare(cls, ratings): +assert isinstance(ratings, RDD), ratings should be RDD +first = ratings.first() +if not isinstance(first, Rating): +if isinstance(first, (tuple, list)): +ratings = ratings.map(lambda x: Rating(*x)) +else: +raise ValueError(rating should be RDD of Rating or tuple/list) +# serialize them by AutoBatchedSerializer before cache to reduce the +# objects overhead in JVM +cached = ratings._reserialize(AutoBatchedSerializer(PickleSerializer())).cache() --- End diff -- FWIW it looks like the DecisionTree learner can do multiple reads (of varying types) of its input. In DecisionTree.train, DecisionTreeMetadata.buildMetadata does a count() on the input, DecisionTree.findSplitsBins can do a sampled read, and then TreePoint.convertToTreeRDD will do a mapped read that gets persisted. I'm not knowledgable enough to know how expensive that initial count() will be without further investigation. But I think for DecisionTree the suggestion was not to cache before learning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-55747592 Hi, per the discussion in https://github.com/apache/spark/pull/2362 the plan is to continue caching before deserialization from python rather than after, in order to minimize the cached rdd memory footprint. This means that, without further work, warning messages will be logged for every python mllib regression and kmeans run. I added a patch that suppresses these warning messages during python runs in a way that I think is fairly unobtrusive. Please let me know what you think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/2412 [SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners. The NaiveBayes, ALS, and DecisionTree learners do not require external caching to prevent repeated RDD re-evaluation during learning iterations. NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-3550 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2412.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2412 commit c8ff120945da4c1aa2d0c9ba81fbed79de6cab66 Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-09-15T20:22:27Z [SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55762293 @davies I created a separate PR for disabling automatic caching for some learners: https://github.com/apache/spark/pull/2412 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55793986 Great! Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2385#discussion_r17528982 --- Diff: python/pyspark/traceback_utils.py --- @@ -0,0 +1,80 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from collections import namedtuple +import os +import traceback + + +__all__ = [extract_concise_traceback, SparkContext] --- End diff -- Looks like I also need to put JavaStackTrace here instead of SparkContext. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55636095 @davies understood, thanks for the feedback. It sounds like for now the preference is to continue caching the python serialized version because the reduced memory footprint is currently worth the cpu cost of repeated deserialization. Would it make sense to preserve the portions of this patch that drop caching for the NaiveBayes, ALS, and DecisionTree learners, which I do not believe require external caching to prevent repeated RDD re-evaluation during learning? NavieBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55681425 @mengxr I ran for 100 iterations. Loaded data from disk using python's SparkContext.pickleFile() (disk is ssd). I did not do any manual caching. For more details, you can also see the test script I included in my description above. I also saved the logs from my test runs if those are helpful to see. During the 10m record run I saw many log messages about 'CacheManager: Not enough space to cache partition' which I interpreted as indicating lack of caching due to memory exhaustion. But I haven't diagnosed the slowdown beyond that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55681746 Just to make sure I understand, are you saying you don't expect to see the measured performance hit with this PR? If so I can investigate further. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55682979 For the PR code, it looks like on each training iteration there are messages about not being able to cache partitions rdd_4_5 - rdd_4_27 For the master code, it looks like on each training iteration there are messages about not being able to cache partitions rdd_3_13, rdd_3_15 - rdd_3_27 It looks to me like a greater proportion of the data can be cached in master, and the set of cached partitions seems consistent across all training iterations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2385#issuecomment-55687608 Great! Thanks to all the reviewers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55693317 Ok, just merged with master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2385#issuecomment-55526943 Hi, the above failure in NetworkReceiverSuite.scala seems like it may be unrelated to this patch. That test also passed when I ran locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55539838 I ran a simple logistic regression performance test on my local machine (ubuntu desktop w/ 8gb ram). I used two data sizes: 2m records, which was not memory constrained, and 10m records which was memory constrained (generating log messages such as `CacheManager: Not enough space to cache partition`). I tested without this patch, with this patch, and with a modified version of this patch using `MEMORY_ONLY_SER` to persist the deserialized objects. Here are the results (each reported runtime is the mean of 3 runs): 2m records: master | 47.9099563758 w/ patch | 32.1143682798 w/ MEMORY_ONLY_SER | 79.4589416981 10m records: master | 2130.3178509871 w/ patch | 3232.856136322 w/ MEMORY_ONLY_SER | 2772.3923886617 It looks like, running in memory, this patch provides a 33% speed improvement, while the `MEMORY_ONLY_SER` version is 66% slower than master. In the test case with insufficient memory to keep all the `cache()`-ed training rdd partitions cached at once, this patch is 52% slower while `MEMORY_ONLY_SER` is 30% slower. Iâm not that familiar with the typical mllib memory profile. Do you think the in-memory result here would be similar to a real world run? Finally, here is the test script. Let me know if it seems reasonable. The data generation was roughly inspired by your mllib perf test in spark-perf. Data generation: import random from pyspark import SparkContext from pyspark.mllib.regression import LabeledPoint class NormalGenerator: def __init__(self): self.mu = random.random() self.sigma = random.random() def __call__(self, rnd): return rnd.normalvariate(self.mu,self.sigma) class PointGenerator: def __init__(self): self.generators = [[NormalGenerator() for _ in range(5)] for _ in range(2)] def __call__(self, rnd): label = rnd.choice([0, 1]) return LabeledPoint(float(label),[g(rnd) for g in self.generators[label]]) pointGenerator = PointGenerator() sc = SparkContext() def generatePoints(n): def generateData(index): rnd = random.Random(hash(str(index))) for _ in range(n / 10): yield pointGenerator(rnd) points = sc.parallelize(range(10), 10).flatMap(generateData) print points.count() points.saveAsPickleFile('logistic%.0e' % n) generatePoints(int(2e6)) generatePoints(int(1e7)) Test: import time import sys from pyspark import SparkContext from pyspark.mllib.classification import LogisticRegressionWithSGD sc = SparkContext() points = sc.pickleFile(sys.argv[1]) start = time.time() model = LogisticRegressionWithSGD.train(points, 100) print 'Runtime: ' + `(time.time() - start)` print model.weights --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/2385 [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. Also made minor cleanups to JavaStackTrace. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-1087 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2385.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2385 commit 10ba6ec834dd8bb0f5a6630cd9315e9f9efb Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-09-14T04:13:22Z [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55516200 Thanks for taking a look, guys. Hmm, it looks like the duplicate __all__ variables result from a recent merge. I went ahead and created a separate PR for SPARK-1087, to put the traceback code in its own file: https://github.com/apache/spark/pull/2385. Once thatâs merged to master Iâll circle back to finish up this PR, making the include changes you requested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55303441 Hi, I implemented this per discussion here https://github.com/apache/spark/pull/2347#issuecomment-55181535, assuming I understood the comment correctly. The context is that we are supposed to log a warning when running an iterative learning algorithm on an uncached rdd. What originally led me to identify SPARK-3488 is that if the deserialized python rdds are always uncached, a warning will always be logged. Obviously a meaningful performance difference would trump the implementation of this warning message, and I haven't measured performance - just discussed options in the above referenced pull request. But by way of comparison, is there any significant difference in memory pressure between caching a LabeledPoint rdd deserialized from python and caching a LabeledPoint rdd crated natively in scala (which is the typical use case with a scala rather than python client)? If I should do some performance testing, are there any examples of tests and infrastructure you'd suggest as a starting point? 'none' means the rdd is not cached within the python - scala mllib interface, where previously it was cached. The learning algorithms for which rdds are no longer cached implement their own caching internally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55307274 @JoshRosen Great, thanks for your help! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55227656 Thanks!, I'm not getting any more NPEs now. I went ahead and merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55283908 Hi, the failing test was CheckpointSuite / 'recovery with file input stream'. This test passed when I ran the tests locally, and it sometimes fails spuriously according to this ticket: 'flaky test case in streaming.CheckpointSuite' https://issues.apache.org/jira/browse/SPARK-1600 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55283980 I'm assuming I don't have permission to ask jenkins to run a test myself, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/2362 [SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant iterative learners. When running an iterative learning algorithm, it makes sense that the input RDD be cached for improved performance. When learning is applied to a python RDD, previously the python RDD was always cached, then in scala that cached RDD was mapped to an uncached deserialized RDD, and the uncached RDD was passed to the learning algorithm. Since the RDD with deserialized data was uncached, learning algorithms would implicitly deserialize the same data repeatedly, on every iteration. This patch moves RDD caching after deserialization for learning algorithms that should be called with a cached RDD. For algorithms that implement their own caching internally, the input RDD is no longer cached. Below Iâve listed the different learning routines accessible from python, the location where caching was previously enabled, and the location (if any) where caching is now enabled by this patch. LogisticRegressionWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) SVMWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) NaiveBayes: was: python (in _get_unmangled_labeled_point_rdd) now: none KMeans: was: python (in _get_unmangled_double_vector_rdd) now: jvm (trainKMeansModel) ALS: was: python (in _get_unmangled_rdd) now: none LinearRegressionWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) LassoWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) RidgeRegressionWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) DecisionTree: was: python (in _get_unmangled_labeled_point_rdd) now: none You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-3488 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2362.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2362 commit 7042ebc2214f13c7d5d5acd28fcfa0478c1ddf2c Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-09-11T05:11:11Z [SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant iterative learners. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2347#discussion_r17430431 --- Diff: docs/mllib-linear-methods.md --- @@ -470,7 +471,7 @@ public class LinearRegression { } } ); -JavaRDDObject MSE = new JavaDoubleRDD(valuesAndPreds.map( +double MSE = new JavaDoubleRDD(valuesAndPreds.map( --- End diff -- :) Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-55288257 Hi, I made the requested comment changes. I also filed a separate PR for the caching changes: #2362 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/2347 [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data. Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithmâs current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrixâs computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning. I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers themselves, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok. This patch causes all calls to GeneralizedLinearAlgorithm from Python to print a warning, because the implementation in PythonMLLibAPI.trainRegressionModel deserializes the data from python using map(SerDe.deserializeLabeledPoint) to create a deserialized RDD without caching this new RDD. This means that deserialization must occur on every training iteration for RDDs originating in Python. Perhaps the python cache() call from _regression_train_wrapper / _get_unmangled_labeled_point_rdd should be moved to be after deserialization instead of before serialization. There is a similar issue in KMeans. Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-1484 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2347.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2347 commit 7b31102b3ad68e821a21a31ab3e49fe069c98e9e Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-09-10T14:18:17Z Minor doc example fixes. commit bc90b68094c32678aa41fd65756105f9d3dd414b Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-09-10T14:19:58Z [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-55138304 See above where I describe how, for python RDDs, the input data is automatically cached and then deserialized via a map to an uncached RDD, requiring deserialization of every row for every training iteration. Would it make sense to change this to cache after deserializing instead of before? If so I can file a new ticket and PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1706#issuecomment-55158265 Ok, I fixed the test failure (I forgot to update a commit after a merge) and addressed the review comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-55162937 Sure, I changed the warning message text as you suggested. Do you think the deserialization mapping in the python RDDs I described is ok (a lightweight operation)? If so, I imagine it would be a problem for the warning message to always be printed when Python is used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1706#issuecomment-55176205 Hi, it looks like the above failure came from upstream, not this patch, fixed here: [HOTFIX] Fix scala style issue introduced by #2276 https://github.com/apache/spark/commit/26503fdf20f4181a2b390c88b83f364e6a4ccc21 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1706#issuecomment-55197092 I fixed a compilation error that arose due to a non 'conflicting' merge issue with an upstream commit from earlier today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1706#issuecomment-55003039 The primary master change resulting in a merge conflict with this patch was the addition of a call to the new ExtractPythonUdfs rule in HiveContext. For now I decided to leave the explicit call to the ExtractPythonUdfs rule in place, rather than incorporate it into the Analyzer along with the CreateTables and PreInsertionCasts rules I moved there. I am not aware that ExtractPythonUdfs can affect query plan resolution, and I havenât checked what the consequences would be if the ExtractPythonUdfs rule were called multiple times (on multiple PythonUDFs) which could happen if the rule were added to an Analyzer batch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1706#issuecomment-55003190 For #1846, either before or after works for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55003590 Great! I'll go ahead and merge once #2323 is in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/1706#discussion_r17340821 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -40,7 +40,12 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool // TODO: pass this in as a parameter. val fixedPoint = FixedPoint(100) - val batches: Seq[Batch] = Seq( + /** + * Override to provide additional rules for the Resolution batch. + */ + val extendedRules: List[Rule[LogicalPlan]] = Nil --- End diff -- I'd made it a List in order to use the triple colon concat operator in Analyzer for consistency with the existing code that uses double colon, but sure I can change to Seq instead since that's preferred. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1706#issuecomment-55066004 Ok, sure I'll move ExtractPythonUdfs as you suggest. Thanks for taking a look! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1395#issuecomment-54856229 Sure, I fixed the new python style issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-51014567 Hi folks Iâve merged with the most recent code (pushed to my branch), but for some reason with the most recent merge I am getting NPEs in Kryo for schemas containing array data type fields in the sql.py tests. Iâm away from home, with no real dev system and spotty internet access until Thursday, so unfortunately I think itâs impractical to diagnose the problem until then. Sorry for the delay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50981508 Sorry, I'm away from home and had limited time / access to try and do the merge last night - which I didn't finish, and as you mentioned messed up the included commits. I'll post an explicit comment here when the merge is ready. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1362#issuecomment-50981523 Great, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50722658 Sure, no problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1362#issuecomment-50831557 I fixed the recent merge conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2781] Check resolution of LogicalPlans ...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/1706 [SPARK-2781] Check resolution of LogicalPlans in Analyzer. LogicalPlan contains a âresolvedâ attribute indicating that all of its execution requirements have been resolved. This attribute is not checked before query execution. The analyzer contains a step to check that all Expressions are resolved, but this is not equivalent to checking all LogicalPlans. In particular, the Union planâs implementation of âresolvedâ verifies that the types of its childrenâs columns are compatible. Because the analyzer does not check that a Union plan is resolved, it is possible to execute a Union plan that outputs different types in the same column. See SPARK-XXX for an example. This patch adds two checks to the analyzerâs CheckResolution rule. First, each logical plan is checked to see if it is not resolved despite its children being resolved. This allows the âproblemâ unresolved plan to be included in the TreeNodeException for reporting. Then as a backstop the root plan is checked to see if it is resolved, which recursively checks that the entire plan tree is resolved. Note that the resolved attribute is implemented recursively, and this patch also explicitly checks the resolved attribute on each logical plan in the tree. I assume the query plan trees will not be large enough this redundant checking to meaningfully impact performance. Because this patch starts validating that LogicalPlans are resolved before execution, I had to fix some cases where unresolved plans were passing through the analyzer as part of the implementation of the hive query system. In particular, HiveContext applies the CreateTables and PreInsertionCasts rules manually after the optimizer runs, meaning query plans are not resolved until after the optimizer. I moved these rules to the analyzer stage (for hive queries only), in the process completing a code TODO indicating the rules should be moved to the analyzer. Itâs worth noting that moving the CreateTables rule means introducing an analyzer rule with a significant side effect - in this case the side effect is creating a hive table. The rule will only attempt to create a table once even if its batch is executed multiple times, because it converts the InsertIntoCreatedTable plan it matches against into an InsertIntoTable. Additionally, these hive rules must be added to the Resolution batch rather than as a separate batch because hive rules rules may be needed to resolve non-root nodes, leaving the root to be resolved on a subsequent batch iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and others, make use of a query plan where the root is a Union and its children are each a hive InsertIntoTable. Mixing the custom hive rules with standard analyzer rules initially resulted in an additional failure because of different policies between spark sql and hive when casting a boolean to a string. Hive casts booleans to strings as âtrueâ / âfalseâ while spark sql casts booleans to strings as â1â / â0â (causing the cast1.q test to fail). This behavior is a result of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the implementation of BooleanCasts I think converting to to â1â/â0â is potentially a programming mistake. (If the BooleanCasts rule is disabled, casting produces âtrueâ/âfalseâ instead.) I believe âtrueâ / âfalseâ should be the behavior for spark sql - I changed the behavior so bools are converted to âtrueâ/âfalseâ to be consistent with hive, and none of the existing spark tests failed. Finally, in some initial testing with hive it appears that an implicit type coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE, ââ ) - âtrueâ while an explicit cast produces an all caps string, e.g. CAST( TRUE AS STRING ) - âTRUEâ. The change Iâve made just converts to lowercase strings in all cases. I believe it is at least more correct than the existing spark sql implementation where all Cast expressions become â1â / â0â. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-2781 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1706.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1706 commit 80a1136d58d719f9f029858a42e068553be87eb0 Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-31T20:42:32Z [SPARK-2781] Check resolution of LogicalPlans in Analyzer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature
[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1706#issuecomment-50849916 Sure, fixed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50577338 Sure, Iâm fine with reworking based on other changes (it seems that some merge conflicts have already cropped up in master since I submitted my PR last week). I think my change set is a little simpler than the one you linked to, so would it make sense for me to wait until that one goes in? I also thought Iâd add a couple of notes on what I had in mind with this patch: 1) I added a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasnât overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code. But let me know if you have any thoughts about this. 2) I moved JavaStackTrace from rdd.py to context.py. This made sense to me since JavaStackTrace is all about configuring a context attribute, and the _extract_concise_traceback function it depends on was already being called separately from context.py (as a âprivateâ function of rdd.py). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50577874 Sure, added the notes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/1395#discussion_r15609908 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala --- @@ -659,6 +659,52 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, V)])( } /** + * Return a new DStream by applying 'full outer join' between RDDs of `this` DStream and + * `other` DStream. Hash partitioning is used to generate the RDDs with Spark's default + * number of partitions. + */ + def fullOuterJoin[W](other: JavaPairDStream[K, W]) + : JavaPairDStream[K, (Optional[V], Optional[W])] = { --- End diff -- It's slightly over the limit at 103 characters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1395#issuecomment-50675424 Thanks for taking a look. The example provided in the descriptive comment for rdd.py's fullOuterJoin implementation will be executed and checked as part of the python test suite. Any other concerns? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/1395#discussion_r15614024 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala --- @@ -659,6 +659,52 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, V)])( } /** + * Return a new DStream by applying 'full outer join' between RDDs of `this` DStream and + * `other` DStream. Hash partitioning is used to generate the RDDs with Spark's default + * number of partitions. + */ + def fullOuterJoin[W](other: JavaPairDStream[K, W]) + : JavaPairDStream[K, (Optional[V], Optional[W])] = { --- End diff -- No worries :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1395#issuecomment-50684019 Cool, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50705813 Hi, I've updated the patch to work with the new code in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Minor indentation and comment typo fixes.
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/1630 Minor indentation and comment typo fixes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1630.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1630 commit 8566467a4230fed02705c2a6a5cc4a68863ebc47 Author: Aaron Staple asta...@gmail.com Date: 2014-07-29T04:05:27Z Fix off by one column indentation in SqlParser. commit 6f295a2dc5f8e83991985cfbe960bfc119661099 Author: Aaron Staple asta...@gmail.com Date: 2014-07-29T04:08:18Z Fix typos in comment about ExprId. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/1362#discussion_r15122197 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1107,7 +1106,6 @@ class DAGScheduler( case shufDep: ShuffleDependency[_, _, _] = val mapStage = getShuffleMapStage(shufDep, stage.jobId) if (!mapStage.isAvailable) { -visitedStages += mapStage --- End diff -- Hi, I filed SPARK-2581 for follow-up work here. https://issues.apache.org/jira/browse/SPARK-2581 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1362#issuecomment-49457211 I added some patches to address the above comments and introduce a timed test. My test uses an RDD with no preferred locations in the entire dependency graph. The reason for this is that some RDD types employ mitigations for the exponential blowup problem by overriding getPreferredLocations to return the preferred locations of their parents. The easiest way to circumvent these mitigations and implement a failing test was to create a dependency graph with no preferred locations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/1362#discussion_r15040670 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1107,7 +1106,6 @@ class DAGScheduler( case shufDep: ShuffleDependency[_, _, _] = val mapStage = getShuffleMapStage(shufDep, stage.jobId) if (!mapStage.isAvailable) { -visitedStages += mapStage --- End diff -- Maybe this is an opportunity for me to get some scala education. I can see that, in this stageDependsOn function, the visitedRdds val is used to prevent re-visiting an RDD, but I only see two uses of _visitedStages_. visitedStages is constructed, and then stages are added to it. Iâm not seeing how visitedStages is being used. Is there some sort of callback when a stage is added to this HashSet? Iâm an admitted scala novice, so please let me know if thereâs some magic going on or Iâm being dense here :) Because I didnât see how visitedStages was being used, I removed this variable in my âRemove unused variable.â commit. Iâll go ahead and withdraw that commit for now. If the variable is in fact in use, Iâd love to learn understand that better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/1362#discussion_r15040672 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1128,6 +1126,23 @@ class DAGScheduler( */ private[spark] def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = synchronized { +getPreferredLocsInternal(rdd, partition, new HashMap) + } + + /** Recursive implementation for getPreferredLocs. */ + private + def getPreferredLocsInternal( + rdd: RDD[_], + partition: Int, + visited: HashMap[RDD[_], HashSet[Int]]) --- End diff -- Sure, will do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2314] Override collect and take in Java...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/1421 [SPARK-2314] Override collect and take in JavaSchemaRDD, forwarding to SchemaRDD implementations. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-2314 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1421.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1421 commit 73e04dc69a5f1a7a12c861f8ead3d915b2eb4142 Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-15T17:35:10Z [SPARK-2314] Override collect and take in JavaSchemaRDD, forwarding to SchemaRDD implementations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/1395 [SPARK-546] Add full outer join to RDD and DStream. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-546 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1395.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1395 commit 0575e2f45ec26eddf7470cc9fc67f52fc53120ff Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-13T18:19:03Z Fix left outer join documentation comments. commit 217614dbe4540bb8155e3fbb6eae56f70795b28d Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-13T18:19:03Z In JavaPairDStream, make class tag specification in rightOuterJoin consistent with other functions. commit 084b2d5edda993975ae2c1794ef8bf4dea2ae11b Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-13T18:19:03Z [SPARK-546] Add full outer join to RDD and DStream. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/1362 [SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited partitions. What do you think of a method like this for avoiding exponential path exploration in DAGScheduler's getPreferredLocs, per SPARK-695? Some minor cleanups are also included. The comment I removed in Dependency.scala seemed incorrect to me, per my commit comment: Remove apparently incorrect comment describing NarrowDependency. Each of a CartesianRDD's dependencies, for example, is a NarrowDependency, but child partitions of these dependencies may depend on shared parent partitions. For example, the cartesian product of RDDs a and b containing partitions [ a_0 ] and [ b_0, b_1 ] respectively will be partitioned as [ ( a_0, b_0 ), ( a_0, b_1 ) ]. Each child partition depends on parent partition a_0. I'm new to spark, scala (and even java) so a careful review may be in order. Thanks, Aaron You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-695 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1362.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1362 commit 16d08271b30be6d33866f4b0411623e53e1149aa Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-10T06:07:37Z [SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited partitions. commit 2f30ba15994b02713a1d52ada954dd596cf6732a Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-10T06:07:37Z Remove apparently incorrect comment describing NarrowDependency. Each of a CartesianRDD's dependencies, for example, is a NarrowDependency, but child partitions of these dependencies may depend on shared parent partitions. For example, the cartesian product of RDDs a and b containing partitions [ a_0 ] and [ b_0, b_1 ] respectively will be partitioned as [ ( a_0, b_0 ), ( a_0, b_1 ) ]. Each child partition depends on parent partition a_0. commit 60f348fe51344dbff6f7d26d8b8bb6db8bf29596 Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-10T06:07:37Z Clarify comment. commit 6da1786e29a250ee2e98e2d34e0306693a75b8bc Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-10T06:07:37Z Remove unused variable. commit ada3026b100824218f7c3ff332738411ef6387ff Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-10T06:07:37Z Fix indentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---