[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...

2015-03-10 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/4934#issuecomment-78192837
  
Hi, replying to some of the statements above:

 It seems @staple has already implemented backtracking (because he has 
results in the JIRA), but kept them out of this PR to keep it simple, so we can 
tackle that afterwards.

I wrote a backtracking implementation (and checked that it performs the 
same as the tfocs implementation). Currently it is just a port of the tfocs 
version. I’d need a little time to make it scala / spark idiomatic, but the 
turnaround would be pretty fast.

 For example, if we add line search option, what is the semantic of 
agd.setStepSize(1.0).useLineSearch()

TFOCS supports a suggested initial lipschitz value (variable named 
‘L’), which is just a starting point for line search, so a corresponding 
behavior would be to use the step size as an initial suggestion only when line 
search is enabled. It may be desirable to use a parameter name like ‘L’ 
instead of ‘stepSize’ to make the meaning clearer.

In TFOCS you can disable backtracking line search by setting several 
parameters (L, Lexact, alpha, and beta) which individually control different 
aspects of the backtracking implementation. 
For spark it may make sense to provide backtracking modes that are 
configured explicitly, for example fixed lipshitz bound (no backtracking), or 
backtracking line search based on the TFOCS implementation, or possibly an 
alternative line search implementation that is more conservative about 
performing round trip aggregations. Then there could be a setBacktrackingMode() 
setter to configure which mode is used.

Moving forward there may be a need to support additional acceleration 
algorithms in addition to Auslender and Teboulle. These might be configurable 
via a setAlgorithm() function.

 Btw, I don't think we need to stick to the current GradientDescent API. 
The accelerated gradient takes a smooth convex function which provides gradient 
and optionally the Lipschitz constant. The implementation of Nesterov's method 
doesn't need to know RDDs.

This is good to know. I had been assuming we would stick with the existing 
GradientDescent api including Gradient and Updater delegates. Currently the 
applySmooth and applyProjector functions (named the same as corresponding TFOCS 
functions) serve as a bridge between the acceleration implementation 
(relatively unaware of RDDs) and spark specific RDD aggregations.

This seems like a good time to mention that the backtracking implementation 
in TFOCS uses a system of caching the (expensive to compute) linear operator 
component of the objective function, which significantly reduces the cost of 
backtracking. A similar implementation is possible in spark, though the 
performance benefit may not be as significant because two round trips would 
still be required per iteration. (See p. 3 of my design doc linked in the jira 
for some more detail.) One reason I suggested not implementing linear operator 
caching in the design doc is because it’s incompatible with the existing 
Gradient interface. If we are considering an alternative interface it may be 
worth revisiting this issue.

The objective function “interface” used by TFOCS involves the functions 
applyLinear (linear operator component of objective), applySmooth (smooth 
portion of objective), applyProjector (nonsmooth portion of objective). In 
addition there are a number of numeric and categorical parameters. 
Theoretically we could adopt a similar interface (with or without applyLinear, 
depending) where RDD specific operations are encapsulated within the various 
apply* functions.

Finally, I wanted to mention that I’m in the bay area and am happy to 
meet in person to discuss this project if that would be helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...

2015-03-10 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/4934#discussion_r26186113
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/AcceleratedGradientDescent.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector = BDV, norm}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * This class optimizes a vector of weights via accelerated (proximal) 
gradient descent.
+ * The implementation is based on TFOCS [[http://cvxr.com/tfocs]], 
described in Becker, Candes, and
+ * Grant 2010.
+ * @param gradient Delegate that computes the loss function value and 
gradient for a vector of
+ * weights.
+ * @param updater Delegate that updates weights in the direction of a 
gradient.
+ */
+@DeveloperApi
+class AcceleratedGradientDescent (private var gradient: Gradient, private 
var updater: Updater)
+  extends Optimizer {
+
+  private var stepSize: Double = 1.0
+  private var convergenceTol: Double = 1e-4
+  private var numIterations: Int = 100
+  private var regParam: Double = 0.0
+
+  /**
+   * Set the initial step size, used for the first step. Default 1.0.
+   * On subsequent steps, the step size will be adjusted by the 
acceleration algorithm.
+   */
+  def setStepSize(step: Double): this.type = {
+this.stepSize = step
+this
+  }
+
+  /**
+   * Set the optimization convergence tolerance. Default 1e-4.
+   * Smaller values will increase accuracy but require additional 
iterations.
+   */
+  def setConvergenceTol(tol: Double): this.type = {
+this.convergenceTol = tol
+this
+  }
+
+  /**
+   * Set the maximum number of iterations. Default 100.
+   */
+  def setNumIterations(iters: Int): this.type = {
+this.numIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set a Gradient delegate for computing the loss function value and 
gradient.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set an Updater delegate for updating weights in the direction of a 
gradient.
+   * If regularization is used, the Updater will implement the 
regularization term's proximity
+   * operator. Thus the type of regularization penalty is configured by 
providing a corresponding
+   * Updater implementation.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  /**
+   * Run accelerated gradient descent on the provided training data.
+   * @param data training data
+   * @param initialWeights initial weights
+   * @return solution vector
+   */
+  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): 
Vector = {
+val (weights, _) = AcceleratedGradientDescent.run(
+  data,
+  gradient,
+  updater,
+  stepSize,
+  convergenceTol,
+  numIterations,
+  regParam,
+  initialWeights)
+weights
+  }
+}
+
+/**
+ * :: DeveloperApi ::
+ * Top-level method to run accelerated (proximal) gradient descent.
+ */
+@DeveloperApi
+object AcceleratedGradientDescent extends Logging {
+  /**
+   * Run accelerated proximal gradient descent.
+   * The implementation is based on TFOCS [[http://cvxr.com/tfocs]], 
described in Becker, Candes,
+   * and Grant 2010

[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...

2015-03-06 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/4934#discussion_r25986928
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/AcceleratedGradientDescent.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector = BDV, norm}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * This class optimizes a vector of weights via accelerated (proximal) 
gradient descent.
+ * The implementation is based on TFOCS [[http://cvxr.com/tfocs]], 
described in Becker, Candes, and
+ * Grant 2010.
+ * @param gradient Delegate that computes the loss function value and 
gradient for a vector of
+ * weights.
+ * @param updater Delegate that updates weights in the direction of a 
gradient.
+ */
+@DeveloperApi
+class AcceleratedGradientDescent (private var gradient: Gradient, private 
var updater: Updater)
+  extends Optimizer {
+
+  private var stepSize: Double = 1.0
+  private var convergenceTol: Double = 1e-4
+  private var numIterations: Int = 100
+  private var regParam: Double = 0.0
+
+  /**
+   * Set the initial step size, used for the first step. Default 1.0.
+   * On subsequent steps, the step size will be adjusted by the 
acceleration algorithm.
+   */
+  def setStepSize(step: Double): this.type = {
--- End diff --

@mengxr Thanks for taking a look. I was advised by Reza Zadeh to implement 
a version without line search, at least for the initial implementation.

Please see discussion here: 
https://issues.apache.org/jira/browse/SPARK-1503?focusedCommentId=14225295, and 
in the following comments. I also attached some optimization benchmarks to the 
jira, which include performance of both backtracking line search and non line 
search implementations. Per your suggestion that it's hard to choose a proper 
stepSize I can attest that, anecdotally, acceleration seems somewhat more 
sensitive to diverging with nominal stepSize than the existing gradient descent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...

2015-03-06 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/4934

[SPARK-1503][MLLIB] Initial AcceleratedGradientDescent implementation.

An implementation of accelerated gradient descent, a first order 
optimization algorithm with faster asymptotic convergence than standard 
gradient descent.

Design discussion and benchmark results at
https://issues.apache.org/jira/browse/SPARK-1503


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-1503

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4934.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4934


commit a121bd0f6e5f2387bd502976940c08f6a4a2e4b1
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2015-02-13T22:24:07Z

[SPARK-1503][MLLIB] Initial AcceleratedGradientDescent implementation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] WIP Add option for distrib...

2015-03-02 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-76874870
  
@srowen I think this stalled because I was anticipating some additional 
feedback on the scala implementation before adding python compatibility. But 
looking things over I think I should just go ahead and add the rest of the 
implementation to move from WIP to formal PR. And I will have time to do that 
in the near future, so let's keep this PR open for now please. Thanks for the 
ping!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] WIP Add option for distrib...

2014-10-07 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-58237889
  
@davies, sure changed the title


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-04 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57919277
  
@mengxr Ok, updated to address your suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-04 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57920855
  
Again, python tests failed because the python interface is disabled in 
order to focus on the scala implementation first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-03 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57824188
  
@mengxr - I’m sure you have a lot on your plate right now, but I wanted 
to check in on this PR. Overall, how do you feel about this approach for 
SPARK-1655? I’m happy to implement it differently if you’d prefer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-03 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57855256
  
@mengxr Sorry about that, in the future I’ll follow the best practice 
you’ve outlined.

Here are the take-aways from my perspective:
- Investigate use of sparse storage for the conditional distribution. I 
believe the existing implementation in master uses dense conditional 
distribution matrices, but sparse is obviously possible.
- Remove grouping of conditional probabilities, as it adds complexity and 
you mentioned you aren’t sure if it will help performance.
- Add support for predictValues with consistent partitioning.

I’ll look into all these. Thanks for your feedback!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-03 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57867834
  
@mengxr Sorry I misunderstood your comment on that first point. I'll just 
do the 2nd and 3rd.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...

2014-09-25 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2412#issuecomment-56865408
  
@davies It looks like in your #2378 you already disabled caching for 
NaiveBayes and DecisionTree. The only difference from this patch is that I 
disabled caching for ALS as well.

We discussed this a bit here: 
https://github.com/apache/spark/pull/2378#discussion_r17686208. I filed this 
ticket as a follow up of the work on uncached input warnings 
(https://github.com/apache/spark/pull/2347). The warnings are only supposed to 
be printed if the input data is accessed repeatedly on many iterations during 
learning. That's not the case with ALS, so a warning shouldn't be printed 
there. But I can see there's a case for caching because the input data is 
accessed twice when constructing an intermediate representation of the data. I 
don't have a strong preference on whether we should or should not cache in 
python for the ALS learner.

If you are fine with continuing to cache in python for ALS, then there's no 
more work to be done for this ticket, SPARK-3550.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-25 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2347#issuecomment-56876698
  
Hi, I addressed the recent review comments and merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...

2014-09-25 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2412#issuecomment-56887646
  
@davies, sure will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...

2014-09-25 Thread staple
Github user staple closed the pull request at:

https://github.com/apache/spark/pull/2412


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-25 Thread staple
Github user staple closed the pull request at:

https://github.com/apache/spark/pull/2362


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-25 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2347#issuecomment-56903523
  
Great, thanks. My username is 'staple', looks like you already assigned to 
me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/2491#discussion_r17866654
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -141,7 +214,41 @@ class NaiveBayes private (private var lambda: Double) 
extends Serializable with
   i += 1
 }
 
-new NaiveBayesModel(labels, pi, theta)
+new LocalNaiveBayesModel(labels, pi, theta)
+  }
+
+  private def trainDistModel(labelAggregates: RDD[(Double, (Long, 
BDV[Double]))]) = {
+case class LabelAggregate(label: Double, numDocuments: Long, 
sumFeatures: BDV[Double])
+val aggregated = labelAggregates.map(x = LabelAggregate(x._1, 
x._2._1, x._2._2))
+
+// Compute the model's prior (pi) vector and conditional (theta) 
matrix for each batch of
+// labels.
+// NOTE In contrast to the local trainer, the piLogDenom normalization 
term is omitted here.
+// Computing this term requires an additional aggregation on 
'aggregated', and because the
+// term is an additive constant it does not affect maximum a 
posteriori model prediction.
+val modelBlocks = aggregated.mapPartitions(p = p.grouped(100).map { 
batch =
--- End diff --

The batch size of 100 here is just an arbitrary value right now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-56414916
  
Hi - Does this seem like a reasonable approach for SPARK-1655?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/2491#discussion_r17866616
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -232,11 +232,11 @@ class PythonMLLibAPI extends Serializable {
   def trainNaiveBayes(
   data: JavaRDD[LabeledPoint],
   lambda: Double): java.util.List[java.lang.Object] = {
-val model = NaiveBayes.train(data.rdd, lambda)
+// val model = NaiveBayes.train(data.rdd, lambda, local)
--- End diff --

I disabled the python interface for now. Let’s figure out the scala 
implementation first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/2491

[SPARK-1655][MLLIB] Add option for distributed naive bayes model.

Adds an option to store a naive bayes model distributively. The default 
behavior, in which the whole model is stored on the driver node, remains 
unchanged. NaiveBayes.train’s new distMode parameter can be used to request 
that a model be distributed.

When distributed, the model is stored as an RDD of model blocks. Each block 
contains the labels and prior and conditional probabilities for a set of label 
classes, allowing fast computation of the maximum a posteriori prediction for 
each block and straightforward aggregation of these MAP predictions across 
blocks.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-1655

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2491.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2491


commit 4594761dd035d2d01b91fb36a9029bda9f34c4a1
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-09-22T05:02:28Z

[SPARK-1655][MLLIB] Add option for distributed naive bayes model.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-56425754
  
Hi - The QA tests failed in python because I disabled the naive bayes 
python api in order to focus on approval of the scala implementation first. (I 
mentioned this in a comment above as well.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17686208
  
--- Diff: python/pyspark/mllib/recommendation.py ---
@@ -54,34 +64,51 @@ def __del__(self):
 def predict(self, user, product):
 return self._java_model.predict(user, product)
 
-def predictAll(self, usersProducts):
-usersProductsJRDD = _get_unmangled_rdd(usersProducts, 
_serialize_tuple)
-return RDD(self._java_model.predict(usersProductsJRDD._jrdd),
-   self._context, RatingDeserializer())
+def predictAll(self, user_product):
+assert isinstance(user_product, RDD), user_product should be RDD 
of (user, product)
+sc = self._context
+tuplerdd = 
sc._jvm.SerDe.asTupleRDD(user_product._to_java_object_rdd().rdd())
+jresult = self._java_model.predict(tuplerdd).toJavaRDD()
+return RDD(sc._jvm.PythonRDD.javaToPython(jresult), sc,
+   AutoBatchedSerializer(PickleSerializer()))
 
 
 class ALS(object):
 
 @classmethod
+def _prepare(cls, ratings):
+assert isinstance(ratings, RDD), ratings should be RDD
+first = ratings.first()
+if not isinstance(first, Rating):
+if isinstance(first, (tuple, list)):
+ratings = ratings.map(lambda x: Rating(*x))
+else:
+raise ValueError(rating should be RDD of Rating or 
tuple/list)
+# serialize them by AutoBatchedSerializer before cache to reduce 
the
+# objects overhead in JVM
+cached = 
ratings._reserialize(AutoBatchedSerializer(PickleSerializer())).cache()
--- End diff --

Hi, just wanted to check on your decision to cache for ALS. It looks like 
in ALS the makeLinkRDDs calls handle persistence for a transformation of the 
input data. Though there are two calls to makeLinkRDDs, so perhaps two reads of 
the input data. Are those two reads the reason for caching here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17687682
  
--- Diff: python/pyspark/mllib/recommendation.py ---
@@ -54,34 +64,51 @@ def __del__(self):
 def predict(self, user, product):
 return self._java_model.predict(user, product)
 
-def predictAll(self, usersProducts):
-usersProductsJRDD = _get_unmangled_rdd(usersProducts, 
_serialize_tuple)
-return RDD(self._java_model.predict(usersProductsJRDD._jrdd),
-   self._context, RatingDeserializer())
+def predictAll(self, user_product):
+assert isinstance(user_product, RDD), user_product should be RDD 
of (user, product)
+sc = self._context
+tuplerdd = 
sc._jvm.SerDe.asTupleRDD(user_product._to_java_object_rdd().rdd())
+jresult = self._java_model.predict(tuplerdd).toJavaRDD()
+return RDD(sc._jvm.PythonRDD.javaToPython(jresult), sc,
+   AutoBatchedSerializer(PickleSerializer()))
 
 
 class ALS(object):
 
 @classmethod
+def _prepare(cls, ratings):
+assert isinstance(ratings, RDD), ratings should be RDD
+first = ratings.first()
+if not isinstance(first, Rating):
+if isinstance(first, (tuple, list)):
+ratings = ratings.map(lambda x: Rating(*x))
+else:
+raise ValueError(rating should be RDD of Rating or 
tuple/list)
+# serialize them by AutoBatchedSerializer before cache to reduce 
the
+# objects overhead in JVM
+cached = 
ratings._reserialize(AutoBatchedSerializer(PickleSerializer())).cache()
--- End diff --

FWIW it looks like the DecisionTree learner can do multiple reads (of 
varying types) of its input. In DecisionTree.train, 
DecisionTreeMetadata.buildMetadata does a count() on the input, 
DecisionTree.findSplitsBins can do a sampled read, and then 
TreePoint.convertToTreeRDD will do a mapped read that gets persisted. I'm not 
knowledgable enough to know how expensive that initial count() will be without 
further investigation. But I think for DecisionTree the suggestion was not to 
cache before learning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-16 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2347#issuecomment-55747592
  
Hi, per the discussion in https://github.com/apache/spark/pull/2362 the 
plan is to continue caching before deserialization from python rather than 
after, in order to minimize the cached rdd memory footprint.

This means that, without further work, warning messages will be logged for 
every python mllib regression and kmeans run. I added a patch that suppresses 
these warning messages during python runs in a way that I think is fairly 
unobtrusive. Please let me know what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...

2014-09-16 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/2412

[SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners.

The NaiveBayes, ALS, and DecisionTree learners do not require external 
caching to prevent repeated RDD re-evaluation during learning iterations. 
NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree 
internally persist transformations of their input RDDs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-3550

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2412.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2412


commit c8ff120945da4c1aa2d0c9ba81fbed79de6cab66
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-09-15T20:22:27Z

[SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-16 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2362#issuecomment-55762293
  
@davies I created a separate PR for disabling automatic caching for some 
learners: https://github.com/apache/spark/pull/2412


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-16 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-55793986
  
Great! Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...

2014-09-15 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/2385#discussion_r17528982
  
--- Diff: python/pyspark/traceback_utils.py ---
@@ -0,0 +1,80 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from collections import namedtuple
+import os
+import traceback
+
+
+__all__ = [extract_concise_traceback, SparkContext]
--- End diff --

Looks like I also need to put JavaStackTrace here instead of SparkContext.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-15 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2362#issuecomment-55636095
  
@davies understood, thanks for the feedback. It sounds like for now the 
preference is to continue caching the python serialized version because the 
reduced memory footprint is currently worth the cpu cost of repeated 
deserialization.

Would it make sense to preserve the portions of this patch that drop 
caching for the NaiveBayes, ALS, and DecisionTree learners, which I do not 
believe require external caching to prevent repeated RDD re-evaluation during 
learning? NavieBayes only evaluates its input RDD once, while ALS and 
DecisionTree internally persist transformations of their input RDDs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-15 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2362#issuecomment-55681425
  
@mengxr I ran for 100 iterations. Loaded data from disk using python's 
SparkContext.pickleFile() (disk is ssd). I did not do any manual caching. For 
more details, you can also see the test script I included in my description 
above.

I also saved the logs from my test runs if those are helpful to see. During 
the 10m record run I saw many log messages about 'CacheManager: Not enough 
space to cache partition' which I interpreted as indicating lack of caching due 
to memory exhaustion. But I haven't diagnosed the slowdown beyond that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-15 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2362#issuecomment-55681746
  
Just to make sure I understand, are you saying you don't expect to see the 
measured performance hit with this PR? If so I can investigate further.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-15 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2362#issuecomment-55682979
  
For the PR code, it looks like on each training iteration there are 
messages about not being able to cache partitions rdd_4_5 - rdd_4_27

For the master code, it looks like on each training iteration there are 
messages about not being able to cache partitions rdd_3_13, rdd_3_15 - rdd_3_27

It looks to me like a greater proportion of the data can be cached in 
master, and the set of cached partitions seems consistent across all training 
iterations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...

2014-09-15 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2385#issuecomment-55687608
  
Great! Thanks to all the reviewers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-15 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-55693317
  
Ok, just merged with master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...

2014-09-14 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2385#issuecomment-55526943
  
Hi, the above failure in NetworkReceiverSuite.scala seems like it may be 
unrelated to this patch. That test also passed when I ran locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-14 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2362#issuecomment-55539838
  
I ran a simple logistic regression performance test on my local machine 
(ubuntu desktop w/ 8gb ram). I used two data sizes: 2m records, which was not 
memory constrained, and 10m records which was memory constrained (generating 
log messages such as `CacheManager: Not enough space to cache partition`). I 
tested without this patch, with this patch, and with a modified version of this 
patch using `MEMORY_ONLY_SER` to persist the deserialized objects. Here are the 
results (each reported runtime is the mean of 3 runs):

2m records:
master | 47.9099563758
w/ patch | 32.1143682798
w/ MEMORY_ONLY_SER | 79.4589416981

10m records:
master | 2130.3178509871
w/ patch | 3232.856136322
w/ MEMORY_ONLY_SER | 2772.3923886617

It looks like, running in memory, this patch provides a 33% speed 
improvement, while the `MEMORY_ONLY_SER` version is 66% slower than master. In 
the test case with insufficient memory to keep all the `cache()`-ed training 
rdd partitions cached at once, this patch is 52% slower while `MEMORY_ONLY_SER` 
is 30% slower.

I’m not that familiar with the typical mllib memory profile. Do you think 
the in-memory result here would be similar to a real world run?

Finally, here is the test script. Let me know if it seems reasonable. The 
data generation was roughly inspired by your mllib perf test in spark-perf.

Data generation:

import random

from pyspark import SparkContext
from pyspark.mllib.regression import LabeledPoint

class NormalGenerator:
def __init__(self):
self.mu = random.random()
self.sigma = random.random()

def __call__(self, rnd):
return rnd.normalvariate(self.mu,self.sigma)

class PointGenerator:
def __init__(self):
self.generators = [[NormalGenerator() for _ in 
range(5)] for _ in range(2)]

def __call__(self, rnd):
label = rnd.choice([0, 1])
return LabeledPoint(float(label),[g(rnd) for g in 
self.generators[label]])

pointGenerator = PointGenerator()
sc = SparkContext()

def generatePoints(n):
def generateData(index):
rnd = random.Random(hash(str(index)))
for _ in range(n / 10):
yield pointGenerator(rnd)

points = sc.parallelize(range(10), 10).flatMap(generateData)
print points.count()
points.saveAsPickleFile('logistic%.0e' % n)

generatePoints(int(2e6))
generatePoints(int(1e7))

Test:

import time
import sys

from pyspark import SparkContext
from pyspark.mllib.classification import LogisticRegressionWithSGD

sc = SparkContext()
points = sc.pickleFile(sys.argv[1])
start = time.time()
model = LogisticRegressionWithSGD.train(points, 100)
print 'Runtime: ' + `(time.time() - start)`
print model.weights



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...

2014-09-13 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/2385

[SPARK-1087] Move python traceback utilities into new traceback_utils.py 
file.

Also made minor cleanups to JavaStackTrace.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-1087

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2385.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2385


commit 10ba6ec834dd8bb0f5a6630cd9315e9f9efb
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-09-14T04:13:22Z

[SPARK-1087] Move python traceback utilities into new traceback_utils.py 
file.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-13 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-55516200
  
Thanks for taking a look, guys.

Hmm, it looks like the duplicate __all__ variables result from a recent 
merge. I went ahead and created a separate PR for SPARK-1087, to put the 
traceback code in its own file: https://github.com/apache/spark/pull/2385. Once 
that’s merged to master I’ll circle back to finish up this PR, making the 
include changes you requested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-11 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2362#issuecomment-55303441
  
Hi, I implemented this per discussion here 
https://github.com/apache/spark/pull/2347#issuecomment-55181535, assuming I 
understood the comment correctly. The context is that we are supposed to log a 
warning when running an iterative learning algorithm on an uncached rdd. What 
originally led me to identify SPARK-3488 is that if the deserialized python 
rdds are always uncached, a warning will always be logged.

Obviously a meaningful performance difference would trump the 
implementation of this warning message, and I haven't measured performance - 
just discussed options in the above referenced pull request. But by way of 
comparison, is there any significant difference in memory pressure between 
caching a LabeledPoint rdd deserialized from python and caching a LabeledPoint 
rdd crated natively in scala (which is the typical use case with a scala rather 
than python client)?

If I should do some performance testing, are there any examples of tests 
and infrastructure you'd suggest as a starting point?

'none' means the rdd is not cached within the python - scala mllib 
interface, where previously it was cached. The learning algorithms for which 
rdds are no longer cached implement their own caching internally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-11 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-55307274
  
@JoshRosen Great, thanks for your help!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-11 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-55227656
  
Thanks!, I'm not getting any more NPEs now. I went ahead and merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-11 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-55283908
  
Hi, the failing test was CheckpointSuite / 'recovery with file input 
stream'. This test passed when I ran the tests locally, and it sometimes fails 
spuriously according to this ticket:

'flaky test case in streaming.CheckpointSuite'
https://issues.apache.org/jira/browse/SPARK-1600


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-11 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-55283980
  
I'm assuming I don't have permission to ask jenkins to run a test myself, 
right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-11 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/2362

[SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant 
iterative learners.

When running an iterative learning algorithm, it makes sense that the input 
RDD be cached for improved performance. When learning is applied to a python 
RDD, previously the python RDD was always cached, then in scala that cached RDD 
was mapped to an uncached deserialized RDD, and the uncached RDD was passed to 
the learning algorithm. Since the RDD with deserialized data was uncached, 
learning algorithms would implicitly deserialize the same data repeatedly, on 
every iteration.

This patch moves RDD caching after deserialization for learning algorithms 
that should be called with a cached RDD. For algorithms that implement their 
own caching internally, the input RDD is no longer cached. Below I’ve listed 
the different learning routines accessible from python, the location where 
caching was previously enabled, and the location (if any) where caching is now 
enabled by this patch.

LogisticRegressionWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)

SVMWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)

NaiveBayes:
was: python (in _get_unmangled_labeled_point_rdd)
now: none

KMeans:
was: python (in _get_unmangled_double_vector_rdd)
now: jvm (trainKMeansModel)

ALS:
was: python (in _get_unmangled_rdd)
now: none

LinearRegressionWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)

LassoWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)

RidgeRegressionWithSGD:
was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
now: jvm (trainRegressionModel)

DecisionTree:
was: python (in _get_unmangled_labeled_point_rdd)
now: none

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-3488

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2362.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2362


commit 7042ebc2214f13c7d5d5acd28fcfa0478c1ddf2c
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-09-11T05:11:11Z

[SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant 
iterative learners.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-11 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/2347#discussion_r17430431
  
--- Diff: docs/mllib-linear-methods.md ---
@@ -470,7 +471,7 @@ public class LinearRegression {
 }
   }
 );
-JavaRDDObject MSE = new JavaDoubleRDD(valuesAndPreds.map(
+double MSE = new JavaDoubleRDD(valuesAndPreds.map(
--- End diff --

:) Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-11 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2347#issuecomment-55288257
  
Hi, I made the requested comment changes. I also filed a separate PR for 
the caching changes: #2362


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-10 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/2347

[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached 
data.

Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when 
called with input data that is not cached. KMeans is implemented iteratively, 
and I believe that GeneralizedLinearAlgorithm’s current optimizers are 
iterative and its future optimizers are also likely to be iterative. 
RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK 
mode. ALS and DecisionTree are iterative as well, but they implement RDD 
caching internally so do not require a warning.

I added a warning to GeneralizedLinearAlgorithm rather than inside its 
optimizers themselves, where the iteration actually occurs, because internally 
GeneralizedLinearAlgorithm maps its input data to an uncached RDD before 
passing it to an optimizer. (In other words, the warning would be printed for 
every GeneralizedLinearAlgorithm run, regardless of whether its input is 
cached, if the warning were in GradientDescent or other optimizer.) I assume 
that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and 
that the mapping there (adding label, intercepts and scaling) is a lightweight 
operation. Arguably a user calling an optimizer such as GradientDescent will be 
knowledgable enough to cache their data without needing a log warning, so lack 
of a warning in the optimizers may be ok.

This patch causes all calls to GeneralizedLinearAlgorithm from Python to 
print a warning, because the implementation in 
PythonMLLibAPI.trainRegressionModel deserializes the data from python using 
map(SerDe.deserializeLabeledPoint) to create a deserialized RDD without caching 
this new RDD. This means that deserialization must occur on every training 
iteration for RDDs originating in Python. Perhaps the python cache() call from 
_regression_train_wrapper / _get_unmangled_labeled_point_rdd should be moved to 
be after deserialization instead of before serialization. There is a similar 
issue in KMeans.

Some of the documentation examples making use of these iterative algorithms 
did not cache their training RDDs (while others did). I updated the examples to 
always cache. I also fixed some (unrelated) minor errors in the documentation 
examples.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-1484

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2347.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2347


commit 7b31102b3ad68e821a21a31ab3e49fe069c98e9e
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-09-10T14:18:17Z

Minor doc example fixes.

commit bc90b68094c32678aa41fd65756105f9d3dd414b
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-09-10T14:19:58Z

[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached 
data.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-10 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2347#issuecomment-55138304
  
See above where I describe how, for python RDDs, the input data is 
automatically cached and then deserialized via a map to an uncached RDD, 
requiring deserialization of every row for every training iteration. Would it 
make sense to change this to cache after deserializing instead of before? If so 
I can file a new ticket and PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...

2014-09-10 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1706#issuecomment-55158265
  
Ok, I fixed the test failure (I forgot to update a commit after a merge) 
and addressed the review comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-10 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2347#issuecomment-55162937
  
Sure, I changed the warning message text as you suggested.

Do you think the deserialization mapping in the python RDDs I described is 
ok (a lightweight operation)? If so, I imagine it would be a problem for the 
warning message to always be printed when Python is used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...

2014-09-10 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1706#issuecomment-55176205
  
Hi, it looks like the above failure came from upstream, not this patch, 
fixed here:
[HOTFIX] Fix scala style issue introduced by #2276

https://github.com/apache/spark/commit/26503fdf20f4181a2b390c88b83f364e6a4ccc21


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...

2014-09-10 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1706#issuecomment-55197092
  
I fixed a compilation error that arose due to a non 'conflicting' merge 
issue with an upstream commit from earlier today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...

2014-09-09 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1706#issuecomment-55003039
  
The primary master change resulting in a merge conflict with this patch was 
the addition of a call to the new ExtractPythonUdfs rule in HiveContext. For 
now I decided to leave the explicit call to the ExtractPythonUdfs rule in 
place, rather than incorporate it into the Analyzer along with the CreateTables 
and PreInsertionCasts rules I moved there. I am not aware that 
ExtractPythonUdfs can affect query plan resolution, and I haven’t checked 
what the consequences would be if the ExtractPythonUdfs rule were called 
multiple times (on multiple PythonUDFs) which could happen if the rule were 
added to an Analyzer batch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...

2014-09-09 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1706#issuecomment-55003190
  
For #1846, either before or after works for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-09 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-55003590
  
Great! I'll go ahead and merge once #2323 is in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...

2014-09-09 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/1706#discussion_r17340821
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -40,7 +40,12 @@ class Analyzer(catalog: Catalog, registry: 
FunctionRegistry, caseSensitive: Bool
   // TODO: pass this in as a parameter.
   val fixedPoint = FixedPoint(100)
 
-  val batches: Seq[Batch] = Seq(
+  /**
+   * Override to provide additional rules for the Resolution batch.
+   */
+  val extendedRules: List[Rule[LogicalPlan]] = Nil
--- End diff --

I'd made it a List in order to use the triple colon concat operator in 
Analyzer for consistency with the existing code that uses double colon, but 
sure I can change to Seq instead since that's preferred.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...

2014-09-09 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1706#issuecomment-55066004
  
Ok, sure I'll move ExtractPythonUdfs as you suggest. Thanks for taking a 
look!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-09-08 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1395#issuecomment-54856229
  
Sure, I fixed the new python style issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-08-03 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-51014567
  
Hi folks I’ve merged with the most recent code (pushed to my branch), but 
for some reason with the most recent merge I am getting NPEs in Kryo for 
schemas containing array data type fields in the sql.py tests. I’m away from 
home, with no real dev system and spotty internet access until Thursday, so 
unfortunately I think it’s impractical to diagnose the problem until then. 
Sorry for the delay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-08-02 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50981508
  
Sorry, I'm away from home and had limited time / access to try and do the 
merge last night - which I didn't finish, and as you mentioned messed up the 
included commits. I'll post an explicit comment here when the merge is ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-08-02 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1362#issuecomment-50981523
  
Great, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-07-31 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50722658
  
Sure, no problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-07-31 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1362#issuecomment-50831557
  
I fixed the recent merge conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2781] Check resolution of LogicalPlans ...

2014-07-31 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/1706

[SPARK-2781] Check resolution of LogicalPlans in Analyzer.

LogicalPlan contains a ‘resolved’ attribute indicating that all of its 
execution requirements have been resolved. This attribute is not checked before 
query execution. The analyzer contains a step to check that all Expressions are 
resolved, but this is not equivalent to checking all LogicalPlans. In 
particular, the Union plan’s implementation of ‘resolved’ verifies that 
the types of its children’s columns are compatible. Because the analyzer does 
not check that a Union plan is resolved, it is possible to execute a Union plan 
that outputs different types in the same column.  See SPARK-XXX for an example.

This patch adds two checks to the analyzer’s CheckResolution rule. First, 
each logical plan is checked to see if it is not resolved despite its children 
being resolved. This allows the ‘problem’ unresolved plan to be included in 
the TreeNodeException for reporting. Then as a backstop the root plan is 
checked to see if it is resolved, which recursively checks that the entire plan 
tree is resolved. Note that the resolved attribute is implemented recursively, 
and this patch also explicitly checks the resolved attribute on each logical 
plan in the tree. I assume the query plan trees will not be large enough this 
redundant checking to meaningfully impact performance.

Because this patch starts validating that LogicalPlans are resolved before 
execution, I had to fix some cases where unresolved plans were passing through 
the analyzer as part of the implementation of the hive query system. In 
particular, HiveContext applies the CreateTables and PreInsertionCasts rules 
manually after the optimizer runs, meaning query plans are not resolved until 
after the optimizer. I moved these rules to the analyzer stage (for hive 
queries only), in the process completing a code TODO indicating the rules 
should be moved to the analyzer.

It’s worth noting that moving the CreateTables rule means introducing an 
analyzer rule with a significant side effect - in this case the side effect is 
creating a hive table. The rule will only attempt to create a table once even 
if its batch is executed multiple times, because it converts the 
InsertIntoCreatedTable plan it matches against into an InsertIntoTable. 
Additionally, these hive rules must be added to the Resolution batch rather 
than as a separate batch because hive rules rules may be needed to resolve 
non-root nodes, leaving the root to be resolved on a subsequent batch 
iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and 
others, make use of a query plan where the root is a Union and its children are 
each a hive InsertIntoTable.

Mixing the custom hive rules with standard analyzer rules initially 
resulted in an additional failure because of different policies between spark 
sql and hive when casting a boolean to a string. Hive casts booleans to strings 
as “true” / “false” while spark sql casts booleans to strings as 
“1” / “0” (causing the cast1.q test to fail). This behavior is a result 
of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the 
implementation of BooleanCasts I think converting to to “1”/“0” is 
potentially a programming mistake. (If the BooleanCasts rule is disabled, 
casting produces “true”/“false” instead.) I believe “true” / 
“false” should be the behavior for spark sql - I changed the behavior so 
bools are converted to “true”/“false” to be consistent with hive, and 
none of the existing spark tests failed.

Finally, in some initial testing with hive it appears that an implicit type 
coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE, 
“” ) - “true” while an explicit cast produces an all caps string, e.g. 
CAST( TRUE AS STRING ) - “TRUE”.  The change I’ve made just converts to 
lowercase strings in all cases.  I believe it is at least more correct than the 
existing spark sql implementation where all Cast expressions become “1” / 
“0”.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-2781

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1706.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1706


commit 80a1136d58d719f9f029858a42e068553be87eb0
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-31T20:42:32Z

[SPARK-2781] Check resolution of LogicalPlans in Analyzer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...

2014-07-31 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1706#issuecomment-50849916
  
Sure, fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-07-30 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50577338
  
Sure, I’m fine with reworking based on other changes (it seems that some 
merge conflicts have already cropped up in master since I submitted my PR last 
week). I think my change set is a little simpler than the one you linked to, so 
would it make sense for me to wait until that one goes in?

I also thought I’d add a couple of notes on what I had in mind with this 
patch:

1) I added a new Row serialization pathway between python and java, based 
on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t 
overjoyed about doing this, but I noticed that some QueryPlans implement 
optimizations in executeCollect(), which outputs an Array[Row] rather than the 
typical RDD[Row] that can be shipped to python using the existing serialization 
code. To me it made sense to ship the Array[Row] over to python directly 
instead of converting it back to an RDD[Row] just for the purpose of sending 
the Rows to python using the existing serialization code. But let me know if 
you have any thoughts about this.

2) I moved JavaStackTrace from rdd.py to context.py. This made sense to me 
since JavaStackTrace is all about configuring a context attribute, and the 
_extract_concise_traceback function it depends on was already being called 
separately from context.py (as a ‘private’ function of rdd.py).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-07-30 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50577874
  
Sure, added the notes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-07-30 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/1395#discussion_r15609908
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala
 ---
@@ -659,6 +659,52 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, 
V)])(
   }
 
   /**
+   * Return a new DStream by applying 'full outer join' between RDDs of 
`this` DStream and
+   * `other` DStream. Hash partitioning is used to generate the RDDs with 
Spark's default
+   * number of partitions.
+   */
+  def fullOuterJoin[W](other: JavaPairDStream[K, W])
+  : JavaPairDStream[K, (Optional[V], Optional[W])] = {
--- End diff --

It's slightly over the limit at 103 characters


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-07-30 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1395#issuecomment-50675424
  
Thanks for taking a look.  The example provided in the descriptive comment 
for rdd.py's fullOuterJoin implementation will be executed and checked as part 
of the python test suite.  Any other concerns?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-07-30 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/1395#discussion_r15614024
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala
 ---
@@ -659,6 +659,52 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, 
V)])(
   }
 
   /**
+   * Return a new DStream by applying 'full outer join' between RDDs of 
`this` DStream and
+   * `other` DStream. Hash partitioning is used to generate the RDDs with 
Spark's default
+   * number of partitions.
+   */
+  def fullOuterJoin[W](other: JavaPairDStream[K, W])
+  : JavaPairDStream[K, (Optional[V], Optional[W])] = {
--- End diff --

No worries :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-07-30 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1395#issuecomment-50684019
  
Cool, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-07-30 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50705813
  
Hi, I've updated the patch to work with the new code in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Minor indentation and comment typo fixes.

2014-07-28 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/1630

Minor indentation and comment typo fixes.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1630.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1630


commit 8566467a4230fed02705c2a6a5cc4a68863ebc47
Author: Aaron Staple asta...@gmail.com
Date:   2014-07-29T04:05:27Z

Fix off by one column indentation in SqlParser.

commit 6f295a2dc5f8e83991985cfbe960bfc119661099
Author: Aaron Staple asta...@gmail.com
Date:   2014-07-29T04:08:18Z

Fix typos in comment about ExprId.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-07-18 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/1362#discussion_r15122197
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1107,7 +1106,6 @@ class DAGScheduler(
 case shufDep: ShuffleDependency[_, _, _] =
   val mapStage = getShuffleMapStage(shufDep, stage.jobId)
   if (!mapStage.isAvailable) {
-visitedStages += mapStage
--- End diff --

Hi, I filed SPARK-2581 for follow-up work here.
https://issues.apache.org/jira/browse/SPARK-2581


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-07-18 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1362#issuecomment-49457211
  
I added some patches to address the above comments and introduce a timed 
test.

My test uses an RDD with no preferred locations in the entire dependency 
graph.  The reason for this is that some RDD types employ mitigations for the 
exponential blowup problem by overriding getPreferredLocations to return the 
preferred locations of their parents.  The easiest way to circumvent these 
mitigations and implement a failing test was to create a dependency graph with 
no preferred locations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-07-16 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/1362#discussion_r15040670
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1107,7 +1106,6 @@ class DAGScheduler(
 case shufDep: ShuffleDependency[_, _, _] =
   val mapStage = getShuffleMapStage(shufDep, stage.jobId)
   if (!mapStage.isAvailable) {
-visitedStages += mapStage
--- End diff --

Maybe this is an opportunity for me to get some scala education.  I can see 
that, in this stageDependsOn function, the visitedRdds val is used to prevent 
re-visiting an RDD, but I only see two uses of _visitedStages_.  visitedStages 
is constructed, and then stages are added to it.  I’m not seeing how 
visitedStages is being used.  Is there some sort of callback when a stage is 
added to this HashSet?  I’m an admitted scala novice, so please let me know 
if there’s some magic going on or I’m being dense here :)

Because I didn’t see how visitedStages was being used, I removed this 
variable in my “Remove unused variable.” commit.  I’ll go ahead and 
withdraw that commit for now.  If the variable is in fact in use, I’d love to 
learn understand that better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-07-16 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/1362#discussion_r15040672
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1128,6 +1126,23 @@ class DAGScheduler(
*/
   private[spark]
   def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = 
synchronized {
+getPreferredLocsInternal(rdd, partition, new HashMap)
+  }
+
+  /** Recursive implementation for getPreferredLocs. */
+  private
+  def getPreferredLocsInternal(
+  rdd: RDD[_],
+  partition: Int,
+  visited: HashMap[RDD[_], HashSet[Int]])
--- End diff --

Sure, will do.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2314] Override collect and take in Java...

2014-07-15 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/1421

[SPARK-2314] Override collect and take in JavaSchemaRDD, forwarding to 
SchemaRDD implementations.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-2314

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1421


commit 73e04dc69a5f1a7a12c861f8ead3d915b2eb4142
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-15T17:35:10Z

[SPARK-2314] Override collect and take in JavaSchemaRDD, forwarding to 
SchemaRDD implementations.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-07-13 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/1395

[SPARK-546] Add full outer join to RDD and DStream.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-546

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1395


commit 0575e2f45ec26eddf7470cc9fc67f52fc53120ff
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-13T18:19:03Z

Fix left outer join documentation comments.

commit 217614dbe4540bb8155e3fbb6eae56f70795b28d
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-13T18:19:03Z

In JavaPairDStream, make class tag specification in rightOuterJoin 
consistent with other functions.

commit 084b2d5edda993975ae2c1794ef8bf4dea2ae11b
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-13T18:19:03Z

[SPARK-546] Add full outer join to RDD and DStream.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-07-10 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/1362

[SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited 
partitions.

What do you think of a method like this for avoiding exponential path 
exploration in DAGScheduler's getPreferredLocs, per SPARK-695?

Some minor cleanups are also included.  The comment I removed in 
Dependency.scala seemed incorrect to me, per my commit comment:

Remove apparently incorrect comment describing NarrowDependency.
  
Each of a CartesianRDD's dependencies, for example, is a 
NarrowDependency, but child partitions of
these dependencies may depend on shared parent partitions.  For 
example, the cartesian product of
RDDs a and b containing partitions [ a_0 ] and [ b_0, b_1 ] 
respectively will be partitioned as
[ ( a_0, b_0 ), ( a_0, b_1 ) ].  Each child partition depends on parent 
partition a_0.

I'm new to spark, scala (and even java) so a careful review may be in order.

Thanks,
Aaron

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-695

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1362.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1362


commit 16d08271b30be6d33866f4b0411623e53e1149aa
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-10T06:07:37Z

[SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited 
partitions.

commit 2f30ba15994b02713a1d52ada954dd596cf6732a
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-10T06:07:37Z

Remove apparently incorrect comment describing NarrowDependency.

Each of a CartesianRDD's dependencies, for example, is a NarrowDependency, 
but child partitions of
these dependencies may depend on shared parent partitions.  For example, 
the cartesian product of
RDDs a and b containing partitions [ a_0 ] and [ b_0, b_1 ] respectively 
will be partitioned as
[ ( a_0, b_0 ), ( a_0, b_1 ) ].  Each child partition depends on parent 
partition a_0.

commit 60f348fe51344dbff6f7d26d8b8bb6db8bf29596
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-10T06:07:37Z

Clarify comment.

commit 6da1786e29a250ee2e98e2d34e0306693a75b8bc
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-10T06:07:37Z

Remove unused variable.

commit ada3026b100824218f7c3ff332738411ef6387ff
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-10T06:07:37Z

Fix indentation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---