[GitHub] spark pull request: [SPARK-2635] Fix race condition at SchedulerBa...

2014-08-01 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/1525#issuecomment-50855070
  
I will take a look at this tomorrow.


On Thu, Jul 31, 2014 at 10:37 PM, Zhihui Li 
wrote:

> @tgravescs  @kayousterhout
>  can you close this PR before code
> frozen of 1.1 release? Otherwise, it would result in incompatible
> configuration property name because the PR rename
> spark.scheduler.maxRegisteredExecutorsWaitingTime to
> spark.scheduler.maxRegisteredResourcesWaitingTime
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...

2014-08-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1578#issuecomment-50855127
  
Thanks for the changes! I've merged this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-08-01 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50855187
  
@mengxr  done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] upgrade dependency to scala-loggi...

2014-08-01 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1701#issuecomment-50855246
  
@pwendell Jenkins does not listen to my commands, Can you command it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2157 Ability to write tight firewa...

2014-08-01 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/1107#issuecomment-50855345
  
I haven't tested this on an actual locked down cluster yet -- it's just
been looking at netstat output so far
On Jul 30, 2014 10:22 PM, "Patrick Wendell" 
wrote:

> @ash211  made some small comments but overall
> looks good. One thing - have you tested this on a cluster? I want to make
> sure we cover every possible connection here (I wonder if there are some 
we
> might have missed). I think the only real way to do it is to totally lock
> down a cluster and see if Spark can run.
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-08-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/940#discussion_r15684374
  
--- Diff: mllib/pom.xml ---
@@ -60,6 +60,14 @@
   junit
   junit
 
+

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1713#discussion_r15684424
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala 
---
@@ -49,43 +49,48 @@ private[stat] trait Correlation {
 }
 
 /**
- * Delegates computation to the specific correlation object based on the 
input method name
- *
- * Currently supported correlations: pearson, spearman.
- * After new correlation algorithms are added, please update the 
documentation here and in
- * Statistics.scala for the correlation APIs.
- *
- * Maintains the default correlation type, pearson
+ * Delegates computation to the specific correlation object based on the 
input method name.
  */
 private[stat] object Correlations {
 
-  // Note: after new types of correlations are implemented, please update 
this map
-  val nameToObjectMap = Map(("pearson", PearsonCorrelation), ("spearman", 
SpearmanCorrelation))
-  val defaultCorrName: String = "pearson"
-  val defaultCorr: Correlation = nameToObjectMap(defaultCorrName)
-
-  def corr(x: RDD[Double], y: RDD[Double], method: String = 
defaultCorrName): Double = {
+  def corr(x: RDD[Double],
+   y: RDD[Double],
+   method: String = CorrelationNames.defaultCorrName): Double = {
 val correlation = getCorrelationFromName(method)
 correlation.computeCorrelation(x, y)
   }
 
-  def corrMatrix(X: RDD[Vector], method: String = defaultCorrName): Matrix 
= {
+  def corrMatrix(X: RDD[Vector],
+  method: String = CorrelationNames.defaultCorrName): Matrix = {
 val correlation = getCorrelationFromName(method)
 correlation.computeCorrelationMatrix(X)
   }
 
-  /**
-   * Match input correlation name with a known name via simple string 
matching
-   *
-   * private to stat for ease of unit testing
-   */
-  private[stat] def getCorrelationFromName(method: String): Correlation = {
+  // Match input correlation name with a known name via simple string 
matching.
+  def getCorrelationFromName(method: String): Correlation = {
 try {
-  nameToObjectMap(method)
+  CorrelationNames.nameToObjectMap(method)
 } catch {
   case nse: NoSuchElementException =>
 throw new IllegalArgumentException("Unrecognized method name. 
Supported correlations: "
-  + nameToObjectMap.keys.mkString(", "))
+  + CorrelationNames.nameToObjectMap.keys.mkString(", "))
 }
   }
 }
+
+/**
+ * Maintains supported and default correlation names.
+ *
+ * Currently supported correlations: `pearson`, `spearman`.
+ * Current default correlation: `pearson`.
+ *
+ * After new correlation algorithms are added, please update the 
documentation here and in
+ * Statistics.scala for the correlation APIs.
+ */
+object CorrelationNames {
--- End diff --

Is it suppose to be a public API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-08-01 Thread witgo
Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/940#discussion_r15684437
  
--- Diff: mllib/pom.xml ---
@@ -60,6 +60,14 @@
   junit
   junit
 
+

[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-08-01 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15684457
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala
 ---
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.spark.SparkConf
+import org.apache.spark.mllib.util.MLStreamingUtils
+import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
+import org.apache.spark.streaming.{Seconds, StreamingContext}
+
+/**
+ * Continually update a model on one stream of data using streaming linear 
regression,
+ * while making predictions on another stream of data
+ *
+ */
+object StreamingLinearRegression {
+
+  def main(args: Array[String]) {
+
+if (args.length != 4) {
+  System.err.println(
+"Usage: StreamingLinearRegression   
 ")
+  System.exit(1)
+}
+
+val conf = new 
SparkConf().setMaster("local").setAppName("StreamingLinearRegression")
+val ssc = new StreamingContext(conf, Seconds(args(2).toLong))
+
+val trainingData = MLStreamingUtils.loadLabeledPointsFromText(ssc, 
args(0))
--- End diff --

I like this idea, but I can actually trivially remove ``MLStreamingUtils`` 
by just moving its ``loadStreamingLabeledPoints`` into ``MLUtils``. Maybe we 
can rename LabeledPoint in another PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-08-01 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15684474
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearRegression.scala
 ---
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.annotation.Experimental
+
+/**
+ * Train or predict a linear regression model on streaming data. Training 
uses
+ * Stochastic Gradient Descent to update the model based on each new batch 
of
+ * incoming data from a DStream (see LinearRegressionWithSGD for model 
equation)
+ *
+ * Each batch of data is assumed to be an RDD of LabeledPoints.
+ * The number of data points per batch can vary, but the number
+ * of features must be constant.
+ */
+@Experimental
+class StreamingLinearRegressionWithSGD private (
+private var stepSize: Double,
+private var numIterations: Int,
+private var miniBatchFraction: Double,
+private var numFeatures: Int)
+  extends StreamingRegression[LinearRegressionModel, 
LinearRegressionWithSGD] with Serializable {
+
+  val algorithm = new LinearRegressionWithSGD(stepSize, numIterations, 
miniBatchFraction)
+
+  var model = algorithm.createModel(Vectors.dense(new 
Array[Double](numFeatures)), 0.0)
+
+}
+
+/**
+ * Top-level methods for calling StreamingLinearRegressionWithSGD.
+ */
+@Experimental
+object StreamingLinearRegressionWithSGD {
+
+  /**
+   * Start a streaming Linear Regression model by setting optimization 
parameters.
+   *
+   * @param numIterations Number of iterations of gradient descent to run.
+   * @param stepSize Step size to be used for each iteration of gradient 
descent.
+   * @param miniBatchFraction Fraction of data to be used per iteration.
+   * @param numFeatures Number of features per record, must be constant 
for all batches of data.
+   */
+  def start(
--- End diff --

Ok, I added setters and do it that way in the example, but kept the static 
``start`` method for consistency with the others, can always drop later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2678][Core] Prevents `spark-submit` fro...

2014-08-01 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1699#issuecomment-50855873
  
Thanks for the great feedback, I agree, will update soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1713#discussion_r15684527
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala 
---
@@ -49,43 +49,48 @@ private[stat] trait Correlation {
 }
 
 /**
- * Delegates computation to the specific correlation object based on the 
input method name
- *
- * Currently supported correlations: pearson, spearman.
- * After new correlation algorithms are added, please update the 
documentation here and in
- * Statistics.scala for the correlation APIs.
- *
- * Maintains the default correlation type, pearson
+ * Delegates computation to the specific correlation object based on the 
input method name.
  */
 private[stat] object Correlations {
 
-  // Note: after new types of correlations are implemented, please update 
this map
-  val nameToObjectMap = Map(("pearson", PearsonCorrelation), ("spearman", 
SpearmanCorrelation))
-  val defaultCorrName: String = "pearson"
-  val defaultCorr: Correlation = nameToObjectMap(defaultCorrName)
-
-  def corr(x: RDD[Double], y: RDD[Double], method: String = 
defaultCorrName): Double = {
+  def corr(x: RDD[Double],
+   y: RDD[Double],
+   method: String = CorrelationNames.defaultCorrName): Double = {
 val correlation = getCorrelationFromName(method)
 correlation.computeCorrelation(x, y)
   }
 
-  def corrMatrix(X: RDD[Vector], method: String = defaultCorrName): Matrix 
= {
+  def corrMatrix(X: RDD[Vector],
+  method: String = CorrelationNames.defaultCorrName): Matrix = {
 val correlation = getCorrelationFromName(method)
 correlation.computeCorrelationMatrix(X)
   }
 
-  /**
-   * Match input correlation name with a known name via simple string 
matching
-   *
-   * private to stat for ease of unit testing
-   */
-  private[stat] def getCorrelationFromName(method: String): Correlation = {
+  // Match input correlation name with a known name via simple string 
matching.
+  def getCorrelationFromName(method: String): Correlation = {
 try {
-  nameToObjectMap(method)
+  CorrelationNames.nameToObjectMap(method)
 } catch {
   case nse: NoSuchElementException =>
 throw new IllegalArgumentException("Unrecognized method name. 
Supported correlations: "
-  + nameToObjectMap.keys.mkString(", "))
+  + CorrelationNames.nameToObjectMap.keys.mkString(", "))
 }
   }
 }
+
+/**
+ * Maintains supported and default correlation names.
+ *
+ * Currently supported correlations: `pearson`, `spearman`.
+ * Current default correlation: `pearson`.
+ *
+ * After new correlation algorithms are added, please update the 
documentation here and in
+ * Statistics.scala for the correlation APIs.
+ */
+object CorrelationNames {
--- End diff --

I originally planned on using it inside of pyspark but `private[mllib]` is 
sufficient scope now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50855852
  
QA tests have started for PR 940. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17660/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2316] Avoid O(blocks) operations i...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50855850
  
QA tests have started for PR 1679. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17659/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50855896
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...

2014-08-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1578


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2738. Remove redundant imports in BlockM...

2014-08-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1642


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2678][Core] Prevents `spark-submit` fro...

2014-08-01 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1699#issuecomment-50855977
  
Also, thank you @vanzin! You're right, it can be done in a downward 
compatible yet simple way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50856200
  
QA tests have started for PR 1418. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17661/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1713#discussion_r15684617
  
--- Diff: python/pyspark/mllib/stat.py ---
@@ -0,0 +1,103 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Python package for statistical functions in MLlib.
+"""
+
+from pyspark.mllib._common import \
+_get_unmangled_double_vector_rdd, _get_unmangled_rdd, \
+_serialize_double, _serialize_double_vector, \
+_deserialize_double, _deserialize_double_matrix
+
+class Statistics(object):
+
+@staticmethod
+def corr(x, y=None, method=None):
+"""
+Compute the correlation (matrix) for the input RDD(s) using the
+specified method.
+Methods currently supported: I{pearson (default), spearman}.
+
+If a single RDD of Vectors is passed in, a correlation matrix
+comparing the columns in the input RDD is returned. Note that the
+method name can be passed in as the second argument without 
C{method=}.
--- End diff --

So it also "supports" `corr(x, y="spearman")`. This adds some convenience 
together with some confusion. If `X` is a matrix, user should use `corr(X, 
method="spearman")`. Example: 
http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1713#discussion_r15684640
  
--- Diff: python/pyspark/mllib/stat.py ---
@@ -0,0 +1,103 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Python package for statistical functions in MLlib.
+"""
+
+from pyspark.mllib._common import \
+_get_unmangled_double_vector_rdd, _get_unmangled_rdd, \
+_serialize_double, _serialize_double_vector, \
+_deserialize_double, _deserialize_double_matrix
+
+class Statistics(object):
+
+@staticmethod
+def corr(x, y=None, method=None):
+"""
+Compute the correlation (matrix) for the input RDD(s) using the
+specified method.
+Methods currently supported: I{pearson (default), spearman}.
+
+If a single RDD of Vectors is passed in, a correlation matrix
+comparing the columns in the input RDD is returned. Note that the
+method name can be passed in as the second argument without 
C{method=}.
+If two RDDs of floats are passed in, a single float is returned.
+
+>>> x = sc.parallelize([1.0, 0.0, -2.0], 2)
+>>> y = sc.parallelize([4.0, 5.0, 3.0], 2)
+>>> zeros = sc.parallelize([0.0, 0.0, 0.0], 2)
+>>> abs(Statistics.corr(x, y) - 0.6546537) < 1e-7
+True
+>>> Statistics.corr(x, y) == Statistics.corr(x, y, "pearson")
+True
+>>> Statistics.corr(x, y, "spearman")
+0.5
+>>> from math import isnan
+>>> isnan(Statistics.corr(x, zeros))
+True
+>>> from linalg import Vectors
+>>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), 
Vectors.dense([4, 5, 0, 3]),
+...   Vectors.dense([6, 7, 0,  8]), 
Vectors.dense([9, 0, 0, 1])])
+>>> Statistics.corr(rdd)
+array([[ 1.,  0.05564149, nan,  0.40047142],
+   [ 0.05564149,  1., nan,  0.91359586],
+   [nan, nan,  1., nan],
+   [ 0.40047142,  0.91359586, nan,  1.]])
+>>> Statistics.corr(rdd, "spearman")
+array([[ 1.,  0.10540926, nan,  0.4   ],
+   [ 0.10540926,  1., nan,  0.9486833 ],
+   [nan, nan,  1., nan],
+   [ 0.4   ,  0.9486833 , nan,  1.]])
+"""
+sc = x.ctx
+# Check inputs to determine whether a single value or a matrix is 
needed for output.
+# Since it's legal for users to use the method name as the second 
argument, we need to
+# check if y is used to specify the method name instead.
+if type(y) == str:
+if not method:
+method = y
+else:
+raise TypeError("Multiple string arguments detected when 
only at most one " \
++ "allowed for Statistics.corr")
+if not y or type(y) == str:
+try:
+Xser = _get_unmangled_double_vector_rdd(x)
+except TypeError:
+raise TypeError("corr called on a single RDD not consisted 
of Vectors.")
+resultMat = sc._jvm.PythonMLLibAPI().corr(Xser._jrdd, method)
+return _deserialize_double_matrix(resultMat)
+else:
+xSer = _get_unmangled_rdd(x, _serialize_double)
+ySer = _get_unmangled_rdd(y, _serialize_double)
+result = sc._jvm.PythonMLLibAPI().corr(xSer._jrdd, ySer._jrdd, 
method)
+return result
+
+
+def _test():
+import doctest
+from pyspark import SparkContext
+globs = globals().copy()
+globs['sc'] = SparkContext('local[4]', 'PythonTest', batchSize=2)
+(failure_count, test_count) = doctest.testmod(globs=globs, 
opt

[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1369#issuecomment-50856657
  
QA results for PR 1369:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17651/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50856746
  
@ueshin   I mostly agree except:
 
let us keep the "length" which can be used for non-strings e.g. 
length(12345678) = 8

Then since length does handle strings as well by using codePoints as you 
suggest: then CharLength is synonym for Length (not the other way around)

AFA OctetLength: I am quite fine with that and will implement it according 
to your suggestion. 
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-08-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1362#issuecomment-50856749
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-08-01 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1362#discussion_r15684786
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1131,6 +1131,23 @@ class DAGScheduler(
*/
   private[spark]
   def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = 
synchronized {
+getPreferredLocsInternal(rdd, partition, new HashSet)
+  }
+
+  /** Recursive implementation for getPreferredLocs. */
+  private
+  def getPreferredLocsInternal(
--- End diff --

Small code style issue: "private" and "def" should be on the same line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-08-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1362#issuecomment-50856865
  
Sorry for taking a bit of time to get to this, but it looks good. I'll 
merge it if the tests pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-08-01 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1362#discussion_r15684804
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -291,6 +293,18 @@ class DAGSchedulerSuite extends 
TestKit(ActorSystem("DAGSchedulerSuite")) with F
 assertDataStructuresEmpty
   }
 
+  test("avoid exponential blowup when getting preferred locs list") {
+// Build up a complex dependency graph with repeated zip operations, 
without preferred locations.
+var rdd: RDD[_] = new MyRDD(sc, 1, Nil)
+(1 to 30).foreach(_ => rdd = rdd.zip(rdd))
+// getPreferredLocs runs quickly, indicating that exponential graph 
traversal is avoided.
+failAfter(5 seconds) {
--- End diff --

Actually another small thing you might fix is increasing this to 10 
seconds, since you sometimes have garbage collection in tests that takes 
longer. I'm assuming it would take minutes with the old code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15684813
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -991,6 +994,9 @@ class SparkContext(config: SparkConf) extends Logging {
 dagScheduler = null
 if (dagSchedulerCopy != null) {
   metadataCleaner.cancel()
+  if (heartbeatReceiver != null) {
--- End diff --

I think I was just being overly defensive


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1464#issuecomment-50857070
  
QA results for PR 1464:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17655/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2134: Report metrics before application ...

2014-08-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1076#issuecomment-50857088
  
Alright, I'm going to merge this as is then. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...

2014-08-01 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1464#issuecomment-50857152
  
Merging into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2316] Avoid O(blocks) operations i...

2014-08-01 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50857283
  
I did some benchmarking by running the following job 100 times one 
immediately after another. Each job launches a many short-lived tasks, each of 
which persists a single block. The minimality of each task allows the listener 
bus to keep posting events very quickly while placing a lot of stress on the 
listeners on consuming the events.
```
sc.parallelize(1 to 2, 100).persist().count()
```
**Before:** The max queue length observed reaches 1 at around the 65th 
job, and finally reaches 16730 after the last job. Before this PR, this is 
enough to cause the queue to start dropping events. The average time spent in 
`StorageUtils.updateRddInfo` (this was renamed) is 176.25ms.

**After:** The max queue length never went above 130, and the average time 
spent in `StorageUtils.updateRddInfo` is 15.47ms, more than 10 times faster 
than before.

The dark side of the story (there is always a dark side), however, is that 
this improvement is only observed for RDDs with not too many partitions. 
Although the new code iterates through only a few RDDs' blocks instead of all 
RDD blocks known to mankind, it is still slow if say a single RDD contains all 
the blocks, in which case we still have to iterate through all the RDD blocks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]Hash Outer Join

2014-08-01 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1147#issuecomment-50857406
  
Thank you @marmbrus I've updated the code as suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-08-01 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1313#discussion_r15684978
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -341,20 +354,27 @@ private[spark] class TaskSetManager(
*
* @return An option containing (task index within the task set, 
locality, is speculative?)
*/
-  private def findTask(execId: String, host: String, locality: 
TaskLocality.Value)
-: Option[(Int, TaskLocality.Value, Boolean)] =
+  private def findTask(execId: String, host: String, maxLocality: 
TaskLocality.Value)
+  : Option[(Int, TaskLocality.Value, Boolean)] =
--- End diff --

Small style issue: you need to indent the `:` by four more spaces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-08-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-50857528
  
Awesome, looks like this passes now! What was it, was there a bug or tests 
being flaky?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1713#discussion_r15685005
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
 ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
   }
 
+  /**
+   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
+   * Returns the correlation matrix serialized into a byte array 
understood by deserializers in
+   * pyspark.
+   */
+  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
+val inputMatrix = X.rdd.map(deserializeDoubleVector(_))
+val result = Statistics.corr(inputMatrix, getCorrNameOrDefault(method))
+serializeDoubleMatrix(to2dArray(result))
+  }
+
+  /**
+   * Java stub for mllib Statistics.corr(x: RDD[Double], y: RDD[Double], 
method: String).
+   */
+  def corr(x: JavaRDD[Array[Byte]], y: JavaRDD[Array[Byte]], method: 
String): Double = {
+val xDeser = x.rdd.map(deserializeDouble(_))
+val yDeser = y.rdd.map(deserializeDouble(_))
+Statistics.corr(xDeser, yDeser, getCorrNameOrDefault(method))
+  }
+
+  // used by the corr methods to retrieve the name of the correlation 
method passed in via pyspark
+  private def getCorrNameOrDefault(method: String) = {
+if (method == null) CorrelationNames.defaultCorrName else method
+  }
+
+  // Reformat a Matrix into Array[Array[Double]] for serialization
+  private[python] def to2dArray(matrix: Matrix): Array[Array[Double]] = {
+val values = matrix.toArray.toIterator
+Array.fill(matrix.numRows)(Array.fill(matrix.numCols)(values.next()))
--- End diff --

`mllib.Matrix` is column majored but `Array[Array[Double]]` is an array of 
rows. I think this is not correct. The error doesn't show up in the python 
tests because corr is symmetric.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]Hash Outer Join

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1147#issuecomment-50857544
  
QA tests have started for PR 1147. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17663/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1713#discussion_r15684998
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
 ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
   }
 
+  /**
+   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
+   * Returns the correlation matrix serialized into a byte array 
understood by deserializers in
+   * pyspark.
+   */
+  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
+val inputMatrix = X.rdd.map(deserializeDoubleVector(_))
+val result = Statistics.corr(inputMatrix, getCorrNameOrDefault(method))
+serializeDoubleMatrix(to2dArray(result))
+  }
+
+  /**
+   * Java stub for mllib Statistics.corr(x: RDD[Double], y: RDD[Double], 
method: String).
+   */
+  def corr(x: JavaRDD[Array[Byte]], y: JavaRDD[Array[Byte]], method: 
String): Double = {
+val xDeser = x.rdd.map(deserializeDouble(_))
+val yDeser = y.rdd.map(deserializeDouble(_))
+Statistics.corr(xDeser, yDeser, getCorrNameOrDefault(method))
+  }
+
+  // used by the corr methods to retrieve the name of the correlation 
method passed in via pyspark
+  private def getCorrNameOrDefault(method: String) = {
+if (method == null) CorrelationNames.defaultCorrName else method
+  }
+
+  // Reformat a Matrix into Array[Array[Double]] for serialization
+  private[python] def to2dArray(matrix: Matrix): Array[Array[Double]] = {
+val values = matrix.toArray.toIterator
--- End diff --

iterator is slow. `Array.tabulate(matrix.numRows, matrix.numCols)((i, j) => 
values(i + j * matrix.numRows))` may be faster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1713#discussion_r15685050
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/api/python/PythonMLLibAPISuite.scala
 ---
@@ -59,10 +59,25 @@ class PythonMLLibAPISuite extends FunSuite {
   }
 
   test("double serialization") {
-for (x <- List(123.0, -10.0, 0.0, Double.MaxValue, Double.MinValue)) {
+for (x <- List(123.0, -10.0, 0.0, Double.MaxValue, Double.MinValue, 
Double.NaN)) {
   val bytes = py.serializeDouble(x)
   val deser = py.deserializeDouble(bytes)
-  assert(x === deser)
+  // We use `equals` here for comparison because we cannot use `==` 
for NaN
+  assert(x.equals(deser))
 }
   }
+
+  test("matrix to 2D array") {
+val values = Array[Double](0, 1.2, 3, 4.56, 7, 8)
+val matrix = Matrices.dense(2, 3, values)
+val twoDarray = py.to2dArray(matrix)
--- End diff --

`twoDArray` or `arr`? `twoDarray` doesn't follow the naming convention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1713#discussion_r15685067
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/api/python/PythonMLLibAPISuite.scala
 ---
@@ -59,10 +59,25 @@ class PythonMLLibAPISuite extends FunSuite {
   }
 
   test("double serialization") {
-for (x <- List(123.0, -10.0, 0.0, Double.MaxValue, Double.MinValue)) {
+for (x <- List(123.0, -10.0, 0.0, Double.MaxValue, Double.MinValue, 
Double.NaN)) {
   val bytes = py.serializeDouble(x)
   val deser = py.deserializeDouble(bytes)
-  assert(x === deser)
+  // We use `equals` here for comparison because we cannot use `==` 
for NaN
+  assert(x.equals(deser))
 }
   }
+
+  test("matrix to 2D array") {
+val values = Array[Double](0, 1.2, 3, 4.56, 7, 8)
+val matrix = Matrices.dense(2, 3, values)
+val twoDarray = py.to2dArray(matrix)
+val expected = Array(Array[Double](0, 1.2, 3), Array[Double](4.56, 7, 
8))
--- End diff --

The expected value should be `(0, 3, 7), (1.2, 4.56, 8)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/855#issuecomment-5085
  
QA results for PR 855:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17656/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] upgrade dependency to scala-loggi...

2014-08-01 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1701#issuecomment-50857755
  
LGTM.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1713#issuecomment-50858129
  
QA results for PR 1713:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class Statistics(object):For more information see 
test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17657/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50858199
  
> let us keep the "length" which can be used for non-strings e.g. 
length(12345678) = 8

Non-string length as described above should probably be handled by 
type-coercion with `CAST`s and not by making the expression more general.  
Otherwise we will have to duplicate all of the logic that is used to turn 
arbitrary types of numbers into strings.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50858363
  
Also, `CharLength` for the `Expression` seems less ambiguous.  Though we 
can certainly alias to `length` in the parser for compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2740: allow user to specify ascending an...

2014-08-01 Thread lirui-intel
Github user lirui-intel commented on the pull request:

https://github.com/apache/spark/pull/1645#issuecomment-50858418
  
Thanks @JoshRosen :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50858453
  
@marmbrus   OK fine with that.  

Then given the inputs from ueshin, we are presently at:

len(gth)/char_length  : take a single string argument and use codePointCount

octet_length  : takes two arguments:  (string, charset)  and returns the 
number of bytes



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15685394
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -320,6 +323,26 @@ private[spark] class TaskSchedulerImpl(
 }
   }
 
+  /**
+   * Update metrics for in-progress tasks and let the master know that the 
BlockManager is still
+   * alive. Return true if the driver knows about the given block manager. 
Otherwise, return false,
+   * indicating that the block manager should re-register.
+   */
+  override def executorHeartbeatReceived(
+  execId: String,
+  taskMetrics: Array[(Long, TaskMetrics)], // taskId -> TaskMetrics
+  blockManagerId: BlockManagerId): Boolean = {
+val metricsWithStageIds = taskMetrics.flatMap {
+  case (id, metrics) => {
+taskIdToTaskSetId.get(id)
--- End diff --

As per our discussion, this is using options everywhere, so it's good. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50858624
  
@marmbrusRE: Charlength for the expression  - also fine, will do.  (btw 
how did you highlight in the comment?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] core - upgrade to json4s-jackson ...

2014-08-01 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1702#issuecomment-50858704
  
test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] core - upgrade to json4s-jackson ...

2014-08-01 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1702#issuecomment-50858698
  
LGTM, if jenkins build succeeds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50858955
  
Oops, I had forgotten that Hive's `Length` can handle binary type.
It would be better to use `Length` instead of `CharLength` and make it 
handle binary type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...

2014-08-01 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50859069
  
@marmbrus @mateiz @JoshRosen Could you take another look as this? I had 
removed schema as string, and added Row as the suggested type to infer schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50859282
  
QA results for PR 1679:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class StorageStatus(val blockManagerId: BlockManagerId, val 
maxMem: Long) {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17659/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50859290
  
QA results for PR 940:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17660/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50859287
  
QA results for PR 1418:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17661/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50859536
  
Yeah I think it's fine to put that too another patch and just make it 
something a bit more on the conservative side (10 seconds) for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2700] [SQL] Hidden files (such as .impa...

2014-08-01 Thread chutium
Github user chutium commented on the pull request:

https://github.com/apache/spark/pull/1691#issuecomment-50859555
  
done, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50859737
  
QA tests have started for PR 1290. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-983] Support external sorting

2014-08-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1090


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix JIRA-983 and support exteranl sort for sor...

2014-08-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/931


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...

2014-08-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1677


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...

2014-08-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1464


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2134: Report metrics before application ...

2014-08-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1076


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-08-01 Thread cfregly
Github user cfregly commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15685837
  
--- Diff: 
extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessorUtils.scala
 ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.kinesis
+
+import scala.util.Random
+
+import org.apache.spark.Logging
+
+import 
com.amazonaws.services.kinesis.clientlibrary.exceptions.InvalidStateException
+import 
com.amazonaws.services.kinesis.clientlibrary.exceptions.KinesisClientLibDependencyException
+import 
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException
+import 
com.amazonaws.services.kinesis.clientlibrary.exceptions.ThrottlingException
+
+
+/**
+ * Helper for the KinesisRecordProcessor.
+ */
+private[kinesis] object KinesisRecordProcessorUtils extends Logging {
--- End diff --

added the companion object


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-08-01 Thread cfregly
Github user cfregly commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15685828
  
--- Diff: 
extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisStringRecordSerializer.scala
 ---
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.kinesis
+
+import org.apache.spark.Logging
+
+/**
+ * Implementation of KinesisRecordSerializer to convert Array[Byte] 
to/from String.
+ */
+class KinesisStringRecordSerializer extends 
KinesisRecordSerializer[String] with Logging {
--- End diff --

i removed the Serializer abstraction and am just using basic byte[] <-> 
String conversions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-08-01 Thread cfregly
Github user cfregly commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15685855
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCount.scala
 ---
@@ -0,0 +1,369 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.streaming
+
+import java.nio.ByteBuffer
+
+import scala.util.Random
+
+import org.apache.log4j.Level
+import org.apache.log4j.Logger
+import org.apache.spark.Logging
+import org.apache.spark.SparkConf
+import org.apache.spark.SparkContext.rddToOrderedRDDFunctions
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.streaming.Milliseconds
+import org.apache.spark.streaming.StreamingContext
+import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions
+import org.apache.spark.streaming.dstream.DStream
+import org.apache.spark.streaming.kinesis.KinesisStringRecordSerializer
+import org.apache.spark.streaming.kinesis.KinesisUtils
+
+import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
+import com.amazonaws.services.kinesis.AmazonKinesisClient
+import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
+import com.amazonaws.services.kinesis.model.PutRecordRequest
+
+/**
+ * Kinesis Spark Streaming WordCount example.
+ *
+ * See 
http://spark.apache.org/docs/latest/streaming-programming-guide.html for more 
details on
+ *   the Kinesis Spark Streaming integration.
+ *
+ * This example spins up 1 Kinesis Worker (Spark Streaming Receivers) per 
shard of the
+ *   given stream.
+ * It then starts pulling from the last checkpointed sequence number of 
the given 
+ *and . 
+ *
+ * Valid endpoint urls:  
http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region
+ * 
+ * This code uses the DefaultAWSCredentialsProviderChain and searches for 
credentials
+ *   in the following order of precedence:
+ * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
+ * Java System Properties - aws.accessKeyId and aws.secretKey
+ * Credential profiles file - default location (~/.aws/credentials) shared 
by all AWS SDKs
+ * Instance profile credentials - delivered through the Amazon EC2 
metadata service
+ *
+ * Usage: KinesisWordCount  
+ *is the name of the Kinesis stream (ie. mySparkStream)
+ *is the endpoint of the Kinesis service
+ * (ie. https://kinesis.us-east-1.amazonaws.com)
+ *
+ * Example:
+ *$ export AWS_ACCESS_KEY_ID=
+ *$ export AWS_SECRET_KEY=
+ *$ $SPARK_HOME/bin/run-example \
+ *org.apache.spark.examples.streaming.KinesisWordCount 
mySparkStream \
+ *https://kinesis.us-east-1.amazonaws.com
+ *
+ * There is a companion helper class below called KinesisWordCountProducer 
which puts
+ *   dummy data onto the Kinesis stream.
+ * Usage instructions for KinesisWordCountProducer are provided in that 
class definition.
+ */
+object KinesisWordCount extends Logging {
+  val WordSeparator = " "
+
+  def main(args: Array[String]) {
+/**
+ * Check that all required args were passed in.
+ */
+if (args.length < 2) {
+  System.err.println("Usage: KinesisWordCount  
")
+  System.exit(1)
+}
+
+/**
+ * (This was lifted from the StreamingExamples.scala in order to avoid 
the dependency
+ *   on the spark-examples artifact.)
+ * Set reasonable logging levels for streaming if the user has not 
configured log4j.
+ */
+val log4jInitialized = 
Logger.getRootLogger.getAllAppenders.hasMoreElements()
+if (!log4jInitialized) {
+  /** 
+   *  We first log something to initialize Spark's default logging, 
+   *  then we override the logging level. 
+   *  */
+  logInfo("Setting log level to [INFO] for streaming example."

[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-08-01 Thread cfregly
Github user cfregly commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15685887
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCount.scala
 ---
@@ -0,0 +1,369 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.streaming
+
+import java.nio.ByteBuffer
+
+import scala.util.Random
+
+import org.apache.log4j.Level
+import org.apache.log4j.Logger
+import org.apache.spark.Logging
+import org.apache.spark.SparkConf
+import org.apache.spark.SparkContext.rddToOrderedRDDFunctions
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.streaming.Milliseconds
+import org.apache.spark.streaming.StreamingContext
+import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions
+import org.apache.spark.streaming.dstream.DStream
+import org.apache.spark.streaming.kinesis.KinesisStringRecordSerializer
+import org.apache.spark.streaming.kinesis.KinesisUtils
+
+import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
+import com.amazonaws.services.kinesis.AmazonKinesisClient
+import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
+import com.amazonaws.services.kinesis.model.PutRecordRequest
+
+/**
+ * Kinesis Spark Streaming WordCount example.
+ *
+ * See 
http://spark.apache.org/docs/latest/streaming-programming-guide.html for more 
details on
+ *   the Kinesis Spark Streaming integration.
+ *
+ * This example spins up 1 Kinesis Worker (Spark Streaming Receivers) per 
shard of the
+ *   given stream.
+ * It then starts pulling from the last checkpointed sequence number of 
the given 
+ *and . 
+ *
+ * Valid endpoint urls:  
http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region
+ * 
+ * This code uses the DefaultAWSCredentialsProviderChain and searches for 
credentials
+ *   in the following order of precedence:
+ * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
+ * Java System Properties - aws.accessKeyId and aws.secretKey
+ * Credential profiles file - default location (~/.aws/credentials) shared 
by all AWS SDKs
+ * Instance profile credentials - delivered through the Amazon EC2 
metadata service
+ *
+ * Usage: KinesisWordCount  
+ *is the name of the Kinesis stream (ie. mySparkStream)
+ *is the endpoint of the Kinesis service
+ * (ie. https://kinesis.us-east-1.amazonaws.com)
+ *
+ * Example:
+ *$ export AWS_ACCESS_KEY_ID=
+ *$ export AWS_SECRET_KEY=
+ *$ $SPARK_HOME/bin/run-example \
+ *org.apache.spark.examples.streaming.KinesisWordCount 
mySparkStream \
+ *https://kinesis.us-east-1.amazonaws.com
+ *
+ * There is a companion helper class below called KinesisWordCountProducer 
which puts
+ *   dummy data onto the Kinesis stream.
+ * Usage instructions for KinesisWordCountProducer are provided in that 
class definition.
+ */
+object KinesisWordCount extends Logging {
+  val WordSeparator = " "
+
+  def main(args: Array[String]) {
+/**
+ * Check that all required args were passed in.
+ */
+if (args.length < 2) {
+  System.err.println("Usage: KinesisWordCount  
")
+  System.exit(1)
+}
+
+/**
+ * (This was lifted from the StreamingExamples.scala in order to avoid 
the dependency
+ *   on the spark-examples artifact.)
+ * Set reasonable logging levels for streaming if the user has not 
configured log4j.
+ */
+val log4jInitialized = 
Logger.getRootLogger.getAllAppenders.hasMoreElements()
+if (!log4jInitialized) {
+  /** 
+   *  We first log something to initialize Spark's default logging, 
+   *  then we override the logging level. 
+   *  */
+  logInfo("Setting log level to [INFO] for streaming example."

[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-08-01 Thread cfregly
Github user cfregly commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15685865
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCount.scala
 ---
@@ -0,0 +1,369 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.streaming
+
+import java.nio.ByteBuffer
+
+import scala.util.Random
+
+import org.apache.log4j.Level
+import org.apache.log4j.Logger
+import org.apache.spark.Logging
+import org.apache.spark.SparkConf
+import org.apache.spark.SparkContext.rddToOrderedRDDFunctions
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.streaming.Milliseconds
+import org.apache.spark.streaming.StreamingContext
+import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions
+import org.apache.spark.streaming.dstream.DStream
+import org.apache.spark.streaming.kinesis.KinesisStringRecordSerializer
+import org.apache.spark.streaming.kinesis.KinesisUtils
+
+import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
+import com.amazonaws.services.kinesis.AmazonKinesisClient
+import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
+import com.amazonaws.services.kinesis.model.PutRecordRequest
+
+/**
+ * Kinesis Spark Streaming WordCount example.
+ *
+ * See 
http://spark.apache.org/docs/latest/streaming-programming-guide.html for more 
details on
+ *   the Kinesis Spark Streaming integration.
+ *
+ * This example spins up 1 Kinesis Worker (Spark Streaming Receivers) per 
shard of the
+ *   given stream.
+ * It then starts pulling from the last checkpointed sequence number of 
the given 
+ *and . 
+ *
+ * Valid endpoint urls:  
http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region
+ * 
+ * This code uses the DefaultAWSCredentialsProviderChain and searches for 
credentials
+ *   in the following order of precedence:
+ * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
+ * Java System Properties - aws.accessKeyId and aws.secretKey
+ * Credential profiles file - default location (~/.aws/credentials) shared 
by all AWS SDKs
+ * Instance profile credentials - delivered through the Amazon EC2 
metadata service
+ *
+ * Usage: KinesisWordCount  
+ *is the name of the Kinesis stream (ie. mySparkStream)
+ *is the endpoint of the Kinesis service
+ * (ie. https://kinesis.us-east-1.amazonaws.com)
+ *
+ * Example:
+ *$ export AWS_ACCESS_KEY_ID=
+ *$ export AWS_SECRET_KEY=
+ *$ $SPARK_HOME/bin/run-example \
+ *org.apache.spark.examples.streaming.KinesisWordCount 
mySparkStream \
+ *https://kinesis.us-east-1.amazonaws.com
+ *
+ * There is a companion helper class below called KinesisWordCountProducer 
which puts
+ *   dummy data onto the Kinesis stream.
+ * Usage instructions for KinesisWordCountProducer are provided in that 
class definition.
+ */
+object KinesisWordCount extends Logging {
+  val WordSeparator = " "
+
+  def main(args: Array[String]) {
+/**
+ * Check that all required args were passed in.
+ */
+if (args.length < 2) {
+  System.err.println("Usage: KinesisWordCount  
")
+  System.exit(1)
+}
+
+/**
+ * (This was lifted from the StreamingExamples.scala in order to avoid 
the dependency
+ *   on the spark-examples artifact.)
+ * Set reasonable logging levels for streaming if the user has not 
configured log4j.
+ */
+val log4jInitialized = 
Logger.getRootLogger.getAllAppenders.hasMoreElements()
+if (!log4jInitialized) {
+  /** 
+   *  We first log something to initialize Spark's default logging, 
+   *  then we override the logging level. 
+   *  */
+  logInfo("Setting log level to [INFO] for streaming example."

[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-08-01 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50860276
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50860395
  
QA tests have started for PR 940. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17666/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50860878
  
Too late for me to think straight enough to incorporate Takuya's latest 
comment tonight. Revisit tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2729] [SQL] Forgot to match Timestamp t...

2014-08-01 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1636#issuecomment-50861117
  
@marmbrus I've discussed with @chutium offline. Since 1.1 code freeze 
deadline is near, we can merge this first. I tested this PR locally and it 
looks good. Will submit another PR to add the test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50861124
  
QA tests have started for PR 1056. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17667/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50861360
  
QA results for PR 1056:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class SparkListenerExecutorMetricsUpdate(case class 
BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends 
ToBlockManagerMasterFor more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17667/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50861872
  
QA tests have started for PR 1056. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17668/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]Hash Outer Join

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1147#issuecomment-50863187
  
QA results for PR 1147:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class HashOuterJoin(For more information see 
test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17663/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50863593
  
QA results for PR 1290:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):abstract class GeneralizedSteepestDescendModel(val weights: 
Vector )trait ANN {class LeastSquaresGradientANN(class ANNUpdater 
extends Updater {class ParallelANN (For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17665/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50864392
  
QA results for PR 940:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17666/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50865776
  
QA results for PR 1056:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class SparkListenerExecutorMetricsUpdate(case class 
BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends 
ToBlockManagerMasterFor more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17668/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50866851
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50867006
  
QA tests have started for PR 1056. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17669/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...

2014-08-01 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1369#issuecomment-50867057
  
@witgo Some of the failure was just Jenkins acting up again (what could be 
behind the "Address already in use" suddenly? for all of these tests), but 
there is a MIMA failure at the end as well. If the API changes are OK they will 
have to be excluded from the MIMA check to pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-08-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1713#discussion_r15689358
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
 ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
   }
 
+  /**
+   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
+   * Returns the correlation matrix serialized into a byte array 
understood by deserializers in
+   * pyspark.
+   */
+  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
--- End diff --

You can ignore this if you like, but if I'm being picky, why not spell out 
"correlations"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812][wip]

2014-08-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/996#discussion_r15689839
  
--- Diff: assembly/pom.xml ---
@@ -26,7 +26,7 @@
   
 
   org.apache.spark
-  spark-assembly_2.10
+  spark-assembly_${scala.binary.version}
--- End diff --

Yeah, you can't write that in the artifact name. It should be a constant 
string and should stay "2.10". In a PR to upgrade to 2.11 later, all of these 
should be "2.11" ... which means 2.10 support may have to be a branch at that 
point? or it may be safe to ignore the warning, if it works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...

2014-08-01 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/1508#issuecomment-50870159
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] streaming - fix parameters to Str...

2014-08-01 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1708#issuecomment-50870258
  
It's not an existing issue in that the test compiles and passes in Scala 
2.10. 

Is this the difference in Scala 2.11? 
https://issues.scala-lang.org/browse/SI-8157
Do I read it right that it was never supposed to be allowed to have a 
default argument in more than one of several overloaded methods?

If so, then yeah it sounds like this has to change, although it ends up 
changing the API. (And honestly, probably better if there were fewer than more 
default arguments.)

In this particular case, there are two constructors with a default 
argument. I suppose the question is which one to keep and what to think about 
the API change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-08-01 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/855#issuecomment-50870292
  
This is definitely better. Can you make add the documentation for this 
property in the configuration page?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1508#issuecomment-50870349
  
QA tests have started for PR 1508. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17670/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2750]Add Https support for Web UI

2014-08-01 Thread WangTaoTheTonic
GitHub user WangTaoTheTonic opened a pull request:

https://github.com/apache/spark/pull/1714

[SPARK-2750]Add Https support for Web UI

https://issues.apache.org/jira/browse/SPARK-2750

Already tested on 1 master, 3worker cluster.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WangTaoTheTonic/spark addHttpsSupport

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1714.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1714


commit f67565b11ee664ea56e5f04f75c7209bc4913450
Author: WangTaoTheTonic 
Date:   2014-08-01T10:44:42Z

Add Https support for Web UI




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-08-01 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/855#discussion_r15690169
  
--- Diff: core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala ---
@@ -206,6 +206,33 @@ class ContextCleanerSuite extends 
ContextCleanerSuiteBase {
 postGCTester.assertCleanup()
   }
 
+  test("automatically cleanup checkpoint data") {
+val conf=new 
SparkConf().setMaster("local[2]").setAppName("cleanupCheckpointData").
+  set("spark.cleaner.checkpointData.enabled","true")
+sc =new SparkContext(conf)
--- End diff --

space missing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...

2014-08-01 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1508#issuecomment-50870356
  
A bunch of tests have been failing spuriously with "java.net.BindException: 
Address already in use". It's not the PR. I wonder what recent change could 
have made this happen? Is something being stricter about assigning a fixed 
port? did a config change to let multiple tests run on the same virtual machine?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...

2014-08-01 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/1508#issuecomment-50870434
  
Also, a number of them have been failing for spurious python mllib issues. 
At least was the case yesterday.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] upgrade to scala-maven-plugin 3.2...

2014-08-01 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1711#issuecomment-50870472
  
PS I am all for making any changes like this that enable 2.11, but are 
still compatible with 2.10. Looks safe vs 3.1.6; I assume it builds? 
http://davidb.github.io/scala-maven-plugin/changes-report.html#a3.2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2750]Add Https support for Web UI

2014-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1714#issuecomment-50870581
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-08-01 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/855#discussion_r15690254
  
--- Diff: core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala ---
@@ -206,6 +206,33 @@ class ContextCleanerSuite extends 
ContextCleanerSuiteBase {
 postGCTester.assertCleanup()
   }
 
+  test("automatically cleanup checkpoint data") {
+val conf=new 
SparkConf().setMaster("local[2]").setAppName("cleanupCheckpointData").
+  set("spark.cleaner.checkpointData.enabled","true")
+sc =new SparkContext(conf)
+val checkpointDir = java.io.File.createTempFile("temp", "")
+checkpointDir.deleteOnExit()
+checkpointDir.delete()
+var rdd = newPairRDD
+sc.setCheckpointDir(checkpointDir.toString)
+rdd.checkpoint()
+rdd.cache()
+rdd.collect()
+val rddId = rdd.id
+RDDCheckpointData.rddCheckpointDataPath(sc, rddId).foreach { path =>
--- End diff --

nit: Would be cool if you add space between pre and post GC with a line of 
comments, like the other unit tests in this suite. To keep things consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2678][Core] Added "--" to prevent spark...

2014-08-01 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/1715

[SPARK-2678][Core] Added "--" to prevent spark-submit from shadowing 
application options

JIRA issue: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

This PR aims to fix SPARK-2678 in a downward compatible way, and replaces 
PR #1699.

All arguments that follow a `--` are passed to user application. Before 
this change, `spark-submit` shadows application options:

```bash
# The "--help" option is captured by spark-submit
bin/spark-submit --class Foo userapp.jar --help
```

With the help of `--`, we can pass arbitrary options to user application:

```bash
# The "--help" option is now passed to userapp.jar
bin/spark-submit --class Foo userapp.jar -- --help
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-2678

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1715.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1715


commit 27741380631e4e293d040d8fc206343814467b37
Author: Cheng Lian 
Date:   2014-08-01T10:27:41Z

Added "--" to prevent spark-submit from shadowing application options




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-08-01 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/855#discussion_r15690344
  
--- Diff: core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala ---
@@ -206,6 +206,33 @@ class ContextCleanerSuite extends 
ContextCleanerSuiteBase {
 postGCTester.assertCleanup()
   }
 
+  test("automatically cleanup checkpoint data") {
+val conf=new 
SparkConf().setMaster("local[2]").setAppName("cleanupCheckpointData").
+  set("spark.cleaner.checkpointData.enabled","true")
--- End diff --

I think the name should have referenceTracking in it as this is a 
sub-configuration applicable only when referenceTracking is enabled. So it 
probably should be something like 
"spark.cleaner.referenceTracking.cleanCheckpoints"
@mridulm @pwendell What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50870753
  
QA results for PR 1056:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class SparkListenerExecutorMetricsUpdate(case class 
BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends 
ToBlockManagerMasterFor more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17669/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-08-01 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15690371
  
--- Diff: 
extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisStringRecordSerializer.scala
 ---
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.kinesis
+
+import org.apache.spark.Logging
+
+/**
+ * Implementation of KinesisRecordSerializer to convert Array[Byte] 
to/from String.
+ */
+class KinesisStringRecordSerializer extends 
KinesisRecordSerializer[String] with Logging {
--- End diff --

Cool!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   4   5   6   7   8   >