[GitHub] spark pull request: [SPARK-9437][core] avoid overflow in SizeEstim...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7750#issuecomment-126308741
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9437][core] avoid overflow in SizeEstim...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7750#issuecomment-126308704
  
  [Test build #162 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/162/console)
 for   PR 7750 at commit 
[`29493f1`](https://github.com/apache/spark/commit/29493f12720dd9f02e8f199046f98f7a548756ea).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9149][ML][Examples] Add an example of s...

2015-07-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7697#discussion_r35861720
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import java.util.regex.Pattern;
+
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.ml.clustering.KMeansModel;
+import org.apache.spark.ml.clustering.KMeans;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.VectorUDT;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.catalyst.expressions.GenericRow;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+
+/**
+ * An example demonstrating a k-means clustering.
+ * Run with
+ * pre
+ * bin/run-example ml.JavaSimpleParamsExample file k
+ * /pre
+ */
+public class JavaKMeansExample {
+
+  private static class ParsePoint implements FunctionString, Row {
+final private static Pattern separator = Pattern.compile( );
--- End diff --

This is picking nits, and something we can fix on merge, but the normal 
order of modifiers is `private static final ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][CORE][WIP] expose Netty network l...

2015-07-30 Thread squito
Github user squito commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-126283119
  
@jerryshao I'm not entirely sure I know what you mean by:

| A simple question, is it enough to only expose the maximum memory usage 
of Netty layer?

can you elaborate?  Obviously we'd always like more metrics, but are you 
saying this isn't useful?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9149][ML][Examples] Add an example of s...

2015-07-30 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/7697#issuecomment-126282672
  
I think this is pretty fine, minus one thing I can fix on merge. Any more 
comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread dragos
Github user dragos commented on a diff in the pull request:

https://github.com/apache/spark/pull/7648#discussion_r35862416
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/rate/PIDRateEstimator.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.scheduler.rate
+
+/**
+ * Implements a proportional-integral-derivative (PID) controller which 
acts on
+ * the speed of ingestion of elements into Spark Streaming. A PID 
controller works
+ * by calculating an '''error''' between a measured output and a desired 
value. In the
+ * case of Spark Streaming the error is the difference between the 
measured processing
+ * rate (number of elements/processing delay) and the previous rate.
+ *
+ * @see https://en.wikipedia.org/wiki/PID_controller
+ *
+ * @param batchDurationMillis the batch duration, in milliseconds
+ * @param proportional how much the correction should depend on the current
+ *error. This term usually provides the bulk of correction. A 
value too large would
+ *make the controller overshoot the setpoint, while a small value 
would make the
+ *controller too insensitive. The default value is -1.
+ * @param integral how much the correction should depend on the 
accumulation
+ *of past errors. This term accelerates the movement towards the 
setpoint, but a large
+ *value may lead to overshooting. The default value is -0.2.
+ * @param derivative how much the correction should depend on a prediction
+ *of future errors, based on current rate of change. This term is 
not used very often,
+ *as it impacts stability of the system. The default value is 0.
+ */
+private[streaming] class PIDRateEstimator(
+batchIntervalMillis: Long,
+proportional: Double = -1D,
+integral: Double = -.2D,
+derivative: Double = 0D)
+  extends RateEstimator {
+
+  private var firstRun: Boolean = true
+  private var latestTime: Long = -1L
+  private var latestRate: Double = -1D
+  private var latestError: Double = -1L
+
+  require(
+batchIntervalMillis  0,
+sSpecified batch interval $batchIntervalMillis in PIDRateEstimator is 
invalid.)
+
+  def compute(time: Long, // in milliseconds
+  elements: Long,
+  processingDelay: Long, // in milliseconds
+  schedulingDelay: Long // in milliseconds
+): Option[Double] = {
+
+this.synchronized {
+  if (time  latestTime  processingDelay  0  batchIntervalMillis 
 0) {
+
+// in seconds, should be close to batchDuration
+val delaySinceUpdate = (time - latestTime).toDouble / 1000
+
+// in elements/second
+val processingRate = elements.toDouble / processingDelay * 1000
+
+// in elements/second
+val error = latestRate - processingRate
--- End diff --

Here I'd prefer to keep this as `error`, as I think most people reading 
this code would have more troubles mapping things to PID terminology than to 
Spark Streaming terminology, and all PID docs will mention error and correction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9308] [ML] ml.NaiveBayesModel support p...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7672#issuecomment-126302698
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8564][Streaming]Add the Python API for ...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6955#issuecomment-126310637
  
  [Test build #39038 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39038/console)
 for   PR 6955 at commit 
[`455f7ea`](https://github.com/apache/spark/commit/455f7ea47cd6bca3047b8023bab8ff0ed944c13e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  public static final class FloatPrefixComparator extends 
PrefixComparator `
  * `class KinesisUtils(object):`
  * `class InitialPositionInStream(object):`
  * `case class UnsafeExternalSort(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8978][Streaming] Implements the DirectK...

2015-07-30 Thread dragos
GitHub user dragos opened a pull request:

https://github.com/apache/spark/pull/7796

[SPARK-8978][Streaming] Implements the DirectKafkaController



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/typesafehub/spark 
topic/streaming-bp/kafka-direct

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7796.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7796


commit f788b9b0dde3981982a118ec3d3bed42b89843f0
Author: François Garillot franc...@garillot.net
Date:   2015-07-14T14:53:03Z

[SPARK-8978][Streaming] Implements the DirectKafkaController




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread dragos
Github user dragos commented on a diff in the pull request:

https://github.com/apache/spark/pull/7648#discussion_r35862291
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/rate/PIDRateEstimator.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.scheduler.rate
+
+/**
+ * Implements a proportional-integral-derivative (PID) controller which 
acts on
+ * the speed of ingestion of elements into Spark Streaming. A PID 
controller works
+ * by calculating an '''error''' between a measured output and a desired 
value. In the
+ * case of Spark Streaming the error is the difference between the 
measured processing
+ * rate (number of elements/processing delay) and the previous rate.
+ *
+ * @see https://en.wikipedia.org/wiki/PID_controller
+ *
+ * @param batchDurationMillis the batch duration, in milliseconds
+ * @param proportional how much the correction should depend on the current
+ *error. This term usually provides the bulk of correction. A 
value too large would
+ *make the controller overshoot the setpoint, while a small value 
would make the
+ *controller too insensitive. The default value is -1.
+ * @param integral how much the correction should depend on the 
accumulation
+ *of past errors. This term accelerates the movement towards the 
setpoint, but a large
+ *value may lead to overshooting. The default value is -0.2.
+ * @param derivative how much the correction should depend on a prediction
+ *of future errors, based on current rate of change. This term is 
not used very often,
+ *as it impacts stability of the system. The default value is 0.
+ */
+private[streaming] class PIDRateEstimator(
+batchIntervalMillis: Long,
+proportional: Double = -1D,
+integral: Double = -.2D,
+derivative: Double = 0D)
+  extends RateEstimator {
+
+  private var firstRun: Boolean = true
+  private var latestTime: Long = -1L
+  private var latestRate: Double = -1D
+  private var latestError: Double = -1L
+
+  require(
+batchIntervalMillis  0,
+sSpecified batch interval $batchIntervalMillis in PIDRateEstimator is 
invalid.)
+
+  def compute(time: Long, // in milliseconds
+  elements: Long,
+  processingDelay: Long, // in milliseconds
+  schedulingDelay: Long // in milliseconds
+): Option[Double] = {
+
+this.synchronized {
+  if (time  latestTime  processingDelay  0  batchIntervalMillis 
 0) {
+
+// in seconds, should be close to batchDuration
+val delaySinceUpdate = (time - latestTime).toDouble / 1000
+
+// in elements/second
+val processingRate = elements.toDouble / processingDelay * 1000
+
+// in elements/second
+val error = latestRate - processingRate
+
+// in elements/second
+val sumError = schedulingDelay.toDouble * processingRate / 
batchIntervalMillis
--- End diff --

Carrying over conversation from previous thread that got lost due to rebase

Its hard to understand what sumError mean in terms of the rates and all? 
Can you write down the physical interpretation of this sumError? And also make 
the name better accordingly? 
cc @huitseeker
@tdas
tdas added a note 14 hours ago
So I am trying to understand this. 
(scheduling delay / batch interval) = approx the number of batches the 
system is delayed. Lets call it numDelayedBatches.
Now you are multiplying numDelayedBatches X processingSpeed. So you are 
scaling the current processing rate with number of batches that are delayed. 
Right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at 

[GitHub] spark pull request: [SPARK-9202] capping maximum number of executo...

2015-07-30 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/7714#issuecomment-126315274
  
finally@srowen, @JoshRosen, @sarutak more comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread dragos
Github user dragos commented on a diff in the pull request:

https://github.com/apache/spark/pull/7648#discussion_r35862261
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/rate/PIDRateEstimator.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.scheduler.rate
+
+/**
+ * Implements a proportional-integral-derivative (PID) controller which 
acts on
+ * the speed of ingestion of elements into Spark Streaming. A PID 
controller works
+ * by calculating an '''error''' between a measured output and a desired 
value. In the
+ * case of Spark Streaming the error is the difference between the 
measured processing
+ * rate (number of elements/processing delay) and the previous rate.
+ *
+ * @see https://en.wikipedia.org/wiki/PID_controller
+ *
+ * @param batchDurationMillis the batch duration, in milliseconds
+ * @param proportional how much the correction should depend on the current
+ *error. This term usually provides the bulk of correction. A 
value too large would
+ *make the controller overshoot the setpoint, while a small value 
would make the
+ *controller too insensitive. The default value is -1.
+ * @param integral how much the correction should depend on the 
accumulation
+ *of past errors. This term accelerates the movement towards the 
setpoint, but a large
+ *value may lead to overshooting. The default value is -0.2.
+ * @param derivative how much the correction should depend on a prediction
+ *of future errors, based on current rate of change. This term is 
not used very often,
+ *as it impacts stability of the system. The default value is 0.
+ */
+private[streaming] class PIDRateEstimator(
+batchIntervalMillis: Long,
+proportional: Double = -1D,
+integral: Double = -.2D,
+derivative: Double = 0D)
+  extends RateEstimator {
+
+  private var firstRun: Boolean = true
+  private var latestTime: Long = -1L
+  private var latestRate: Double = -1D
+  private var latestError: Double = -1L
+
+  require(
+batchIntervalMillis  0,
+sSpecified batch interval $batchIntervalMillis in PIDRateEstimator is 
invalid.)
+
+  def compute(time: Long, // in milliseconds
+  elements: Long,
+  processingDelay: Long, // in milliseconds
+  schedulingDelay: Long // in milliseconds
+): Option[Double] = {
+
+this.synchronized {
+  if (time  latestTime  processingDelay  0  batchIntervalMillis 
 0) {
+
+// in seconds, should be close to batchDuration
+val delaySinceUpdate = (time - latestTime).toDouble / 1000
+
+// in elements/second
+val processingRate = elements.toDouble / processingDelay * 1000
+
+// in elements/second
+val error = latestRate - processingRate
--- End diff --

Carrying over conversation from previous thread that got lost due to rebase

Could you make the names more semantically meaningful? How about: error 
-- changeInRate?
@tdas
tdas added a note 14 hours ago
Why is the latestRate considered as the set point (that's my assumption 
since the error is calculated between the observed value and the set point, 
according to PID theory)? @huitseeker
@dragos   
dragos added a note 2 hours ago
Since @huitseeker seems to be away, I'll answer this.

The latestRate is what we considered the desired value at the previous 
batch update. With the new information we got for the last batch interval, we 
compute a current rate, and compare to what we asked for, that's constitutes 
our error that needs correction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at 

[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7648#issuecomment-126284928
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126292059
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9248][SparkR] Closing curly-braces shou...

2015-07-30 Thread yu-iskw
GitHub user yu-iskw opened a pull request:

https://github.com/apache/spark/pull/7795

[SPARK-9248][SparkR] Closing curly-braces should always be on their own line

### JIRA
[[SPARK-9248] Closing curly-braces should always be on their own line - ASF 
JIRA](https://issues.apache.org/jira/browse/SPARK-9248)

## The result of `dev/lint-r`
[The result of `dev/lint-r` for SPARK-9248 at the 
revistion:6175d6cfe795fbd88e3ee713fac375038a3993a8](https://gist.github.com/yu-iskw/96cadcea4ce664c41f81)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yu-iskw/spark SPARK-9248

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7795.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7795


commit c8eccd3ce0c11ee1b8df36b666017c7bbfbf811f
Author: Yuu ISHIKAWA yuu.ishik...@gmail.com
Date:   2015-07-30T11:32:01Z

[SPARK-9248][SparkR] Closing curly-braces should always be on their own line




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6485] [MLlib] [Python] Add CoordinateMa...

2015-07-30 Thread dusenberrymw
Github user dusenberrymw commented on a diff in the pull request:

https://github.com/apache/spark/pull/7554#discussion_r35873295
  
--- Diff: python/pyspark/mllib/linalg.py ---
@@ -1152,9 +1156,416 @@ def sparse(numRows, numCols, colPtrs, rowIndices, 
values):
 return SparseMatrix(numRows, numCols, colPtrs, rowIndices, values)
 
 
+class DistributedMatrix(object):
+
+Represents a distributively stored matrix backed by one or
+more RDDs.
+
+
+def numRows(self):
+Get or compute the number of rows.
+raise NotImplementedError
+
+def numCols(self):
+Get or compute the number of cols.
+raise NotImplementedError
+
+
+class RowMatrix(DistributedMatrix):
+
+.. note:: Experimental
+
+Represents a row-oriented distributed Matrix with no meaningful
+row indices.
+
+:param rows: An RDD of vectors.
+:param numRows: Number of rows in the matrix. A non-positive
+value means unknown, at which point the number
+of rows will be determined by the number of
+records in the `rows` RDD.
+:param numCols: Number of columns in the matrix. A non-positive
+value means unknown, at which point the number
+of columns will be determined by the size of
+the first row.
+
+def __init__(self, rows, numRows=0, numCols=0):
+Create a wrapper over a Java RowMatrix.
+if not isinstance(rows, RDD):
+raise TypeError(rows should be an RDD of vectors, got %s % 
type(rows))
--- End diff --

Yeah the argument doesn't have to be an RDD of actual `Vector` objects, but 
it should still be an RDD of _vectors_, which could be NumPy arrays, Python 
lists, `Vector`s, etc. for PySpark. The Spark MLlib Data Types guide makes this 
distinction for the end-user, so I think it is helpful to use it in the error 
message as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8625] [Core] Propagate user exceptions ...

2015-07-30 Thread squito
Github user squito commented on the pull request:

https://github.com/apache/spark/pull/7014#issuecomment-126283659
  
@aarondav are you OK with this now?  I think tom addressed all your concerns


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126293359
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8564][Streaming]Add the Python API for ...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6955#issuecomment-126310815
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9214] [ML] [PySpark] support ml.NaiveBa...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7568#issuecomment-126311047
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9202] capping maximum number of executo...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7714#issuecomment-126314864
  
  [Test build #39041 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39041/console)
 for   PR 7714 at commit 
[`23977fb`](https://github.com/apache/spark/commit/23977fb3bc590f58e9d4d44cfcce78ce0a49baca).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9202] capping maximum number of executo...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7714#issuecomment-126314974
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread dragos
Github user dragos commented on a diff in the pull request:

https://github.com/apache/spark/pull/7648#discussion_r35869487
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/rate/PIDRateEstimator.scala
 ---
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.scheduler.rate
+
+/**
+ * Implements a proportional-integral-derivative (PID) controller which 
acts on
+ * the speed of ingestion of elements into Spark Streaming. A PID 
controller works
+ * by calculating an '''error''' between a measured output and a desired 
value. In the
+ * case of Spark Streaming the error is the difference between the 
measured processing
+ * rate (number of elements/processing delay) and the previous rate.
+ *
+ * @see https://en.wikipedia.org/wiki/PID_controller
+ *
+ * @param batchDurationMillis the batch duration, in milliseconds
+ * @param proportional how much the correction should depend on the current
+ *error. This term usually provides the bulk of correction. A 
value too large would
+ *make the controller overshoot the setpoint, while a small value 
would make the
+ *controller too insensitive. The default value is -1.
+ * @param integral how much the correction should depend on the 
accumulation
+ *of past errors. This term accelerates the movement towards the 
setpoint, but a large
+ *value may lead to overshooting. The default value is -0.2.
+ * @param derivative how much the correction should depend on a prediction
+ *of future errors, based on current rate of change. This term is 
not used very often,
+ *as it impacts stability of the system. The default value is 0.
+ */
+private[streaming] class PIDRateEstimator(
+batchIntervalMillis: Long,
+proportional: Double = -1D,
+integral: Double = -.2D,
+derivative: Double = 0D)
+  extends RateEstimator {
+
+  private var firstRun: Boolean = true
+  private var latestTime: Long = -1L
+  private var latestRate: Double = -1D
+  private var latestError: Double = -1L
+
+  require(
+batchIntervalMillis  0,
+sSpecified batch interval $batchIntervalMillis in PIDRateEstimator is 
invalid.)
+
+  def compute(time: Long, // in milliseconds
+  elements: Long,
+  processingDelay: Long, // in milliseconds
+  schedulingDelay: Long // in milliseconds
+): Option[Double] = {
+
+this.synchronized {
+  if (time  latestTime  processingDelay  0  batchIntervalMillis 
 0) {
+
+// in seconds, should be close to batchDuration
+val delaySinceUpdate = (time - latestTime).toDouble / 1000
+
+// in elements/second
+val processingRate = elements.toDouble / processingDelay * 1000
+
+// in elements/second
+val error = latestRate - processingRate
+
+// in elements/second
+val sumError = schedulingDelay.toDouble * processingRate / 
batchIntervalMillis
--- End diff --

Here's the gist of it:

- we consider `schedulingDelay` as an indication of accumulated error, 
which corresponds to the integral part in a PID controller. Intuitively it 
makes sense: the fact that there is a delay means we had too many elements in 
previous batches, and the system can't process them in the given batch interval

The challenge is to transform this indication from *time* to a rate, which 
is the quantity that our PID is measuring (and controlling). Here's the 
reasoning:

- a scheduling delay `s` corresponds to `s * processingRate` *overflowing* 
elements. Those are elements that couldn't be processed in previous batches, 
leading to this delay. We assume the processingRate didn't change too much 
(since it's mostly a measure of the cluster performance, with small variations 
like checkpointing), but a good approximation
-  from the number of overflowing elements we can calculate the 

[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7648#issuecomment-126336586
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

2015-07-30 Thread koeninger
Github user koeninger commented on the pull request:

https://github.com/apache/spark/pull/3543#issuecomment-126342142
  
Added subtasks, changed the title of 
https://github.com/apache/spark/pull/7772 to refer to the streaming subtask 
jira ID.  Let me know if you see anything on that that needs tweaking before 
the 1.5 freeze date


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126293796
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9248][SparkR] Closing curly-braces shou...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7795#issuecomment-126293792
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7648#issuecomment-126320237
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][CORE][WIP] expose Netty network l...

2015-07-30 Thread squito
Github user squito commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-126282507
  
Since we'll eventually want to add more metrics, can you put all the netty 
metrics into another case class inside `ExecutorMetrics`?

Also, I'm wondering if we want to use netty in the name -- I think most 
users won't know or care about netty in particular.  It should it just be named 
network or transport, and the nio implementation should indicate that 
metrics are missing.

I guess altogether this means doing something like:

```scala
class ExecutorMetrics {
  var transportMetrics: Option[TransportMetrics] = ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4229#issuecomment-126307047
  
  [Test build #39036 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39036/console)
 for   PR 4229 at commit 
[`126608a`](https://github.com/apache/spark/commit/126608a02b55287684762811b0ade99dbce7d109).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class MQTTUtils(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4229#issuecomment-126307123
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9437][core] avoid overflow in SizeEstim...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7750#issuecomment-126315107
  
  [Test build #39039 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39039/console)
 for   PR 7750 at commit 
[`29493f1`](https://github.com/apache/spark/commit/29493f12720dd9f02e8f199046f98f7a548756ea).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9437][core] avoid overflow in SizeEstim...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7750#issuecomment-126315194
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8978][Streaming] Implements the DirectK...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7796#issuecomment-126334650
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8978][Streaming] Implements the DirectK...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7796#issuecomment-126363832
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9248][SparkR] Closing curly-braces shou...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7795#issuecomment-126363833
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8862][SPARK-8862][SQL][WIP] Add basic i...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7774#issuecomment-126363834
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][CORE][WIP] expose Netty network l...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-126363839
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8998][MLlib] Distribute PrefixSpan comp...

2015-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7783#discussion_r35881077
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -78,81 +97,153 @@ class PrefixSpan private (
* the value of pair is the pattern's count.
*/
   def run(sequences: RDD[Array[Int]]): RDD[(Array[Int], Long)] = {
+val sc = sequences.sparkContext
+
 if (sequences.getStorageLevel == StorageLevel.NONE) {
   logWarning(Input data is not cached.)
 }
-val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x = (x._1.toSeq, x._2))
-  .groupByKey()
-  .map(x = (x._1.toArray, x._2.toArray))
-val nextPatterns = getPatternsInLocal(minCount, 
groupedProjectedDatabase)
-val lengthOnePatternsAndCountsRdd =
-  sequences.sparkContext.parallelize(
-lengthOnePatternsAndCounts.map(x = (Array(x._1), x._2)))
-val allPatterns = lengthOnePatternsAndCountsRdd ++ nextPatterns
-allPatterns
+
+// Convert min support to a min number of transactions for this dataset
+val minCount = if (minSupport == 0) 0L else 
math.ceil(sequences.count() * minSupport).toLong
+
+// (Frequent items - number of occurrences, all items here satisfy 
the `minSupport` threshold
+val freqItemCounts = sequences
+  .flatMap(seq = seq.distinct.map(item = (item, 1L)))
+  .reduceByKey(_ + _)
+  .filter(_._2 = minCount)
+  .collect()
+
+// Pairs of (length 1 prefix, suffix consisting of frequent items)
+val itemSuffixPairs = {
+  val freqItems = freqItemCounts.map(_._1).toSet
+  sequences.flatMap { seq =
+val filteredSeq = seq.filter(freqItems.contains(_))
+freqItems.flatMap { item =
+  val candidateSuffix = LocalPrefixSpan.getSuffix(item, 
filteredSeq)
+  candidateSuffix match {
+case suffix if !suffix.isEmpty = Some((List(item), suffix))
+case _ = None
+  }
+}
+  }
+}
+
+// Accumulator for the computed results to be returned, initialized to 
the frequent items (i.e.
+// frequent length-one prefixes)
+var resultsAccumulator = freqItemCounts.map(x = (List(x._1), x._2))
+
+// Remaining work to be locally and distributively processed 
respectfully
+var (pairsForLocal, pairsForDistributed) = 
partitionByProjDBSize(itemSuffixPairs)
+
+// Continue processing until no pairs for distributed processing 
remain (i.e. all prefixes have
+// projected database sizes = `maxLocalProjDBSize`)
+while (pairsForDistributed.count() != 0) {
+  val (nextPatternAndCounts, nextPrefixSuffixPairs) =
+extendPrefixes(minCount, pairsForDistributed)
+  pairsForDistributed.unpersist()
+  val (smallerPairsPart, largerPairsPart) = 
partitionByProjDBSize(nextPrefixSuffixPairs)
+  pairsForDistributed = largerPairsPart
+  pairsForDistributed.persist(StorageLevel.MEMORY_AND_DISK)
+  pairsForLocal ++= smallerPairsPart
+  resultsAccumulator ++= nextPatternAndCounts.collect()
--- End diff --

That is the worst case. We should assume that the number of frequent 
patterns are small. Having 1 billion frequent patterns doesn't provide any 
useful insights. So users should start with a high `minSupport` and collect 
just-enough number of frequent patterns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7648#issuecomment-126366590
  
  [Test build #39047 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39047/consoleFull)
 for   PR 7648 at commit 
[`26cfd78`](https://github.com/apache/spark/commit/26cfd78c339e58e71c138e424952002f13595389).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9214] [ML] [PySpark] support ml.NaiveBa...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7568#issuecomment-126367139
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9277] [MLLIB] SparseVector constructor ...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7794#issuecomment-126367052
  
  [Test build #39046 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39046/console)
 for   PR 7794 at commit 
[`6ffe34a`](https://github.com/apache/spark/commit/6ffe34a560829ac0e1f85b92f958ab394b1dda7a).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9277] [MLLIB] SparseVector constructor ...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7794#issuecomment-126367061
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8862][SPARK-8862][SQL][WIP] Add basic i...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7774#issuecomment-126365441
  
  [Test build #39043 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39043/consoleFull)
 for   PR 7774 at commit 
[`23abf73`](https://github.com/apache/spark/commit/23abf73cafac3af0363486bdae91d737e235a197).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][CORE][WIP] expose Netty network l...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-126366504
  
  [Test build #39044 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39044/consoleFull)
 for   PR 7753 at commit 
[`17e5b97`](https://github.com/apache/spark/commit/17e5b978618a5a6adfa3ff621e37eeecaa0b2b0c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9308] [ML] ml.NaiveBayesModel support p...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7672#issuecomment-126366480
  
  [Test build #39050 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39050/consoleFull)
 for   PR 7672 at commit 
[`3ee56d6`](https://github.com/apache/spark/commit/3ee56d68cc0404f8700641da2cf34c9a79fe2ba4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9308] [ML] ml.NaiveBayesModel support p...

2015-07-30 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/7672#discussion_r35880303
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
@@ -129,29 +129,49 @@ class NaiveBayesModel private[ml] (
   throw new UnknownError(sInvalid modelType: ${$(modelType)}.)
   }
 
-  override protected def predict(features: Vector): Double = {
+  override val numClasses: Int = pi.size
+
+  private def posteriorProbabilities(logProb: DenseVector) = {
--- End diff --

Yes, posteriorProbabilities is easy to reuse, but  it not easy to directly 
reuse multinomialCalculation, and bernoulliCalculation because the 
mllib.NaiveBayesModel and ml.NaiveBayesModel has different model parameters.
```java
class NaiveBayesModel private[ml] (
override val uid: String,
val pi: Vector,
val theta: Matrix)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9408] [PySpark] [MLlib] Refactor linalg...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7746#issuecomment-126371744
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9408] [PySpark] [MLlib] Refactor linalg...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7746#issuecomment-126371787
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9277] [MLLIB] SparseVector constructor ...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7794#issuecomment-126372904
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126357332
  
@jkbradley I create a new version of InformationGainStats called 
[ImpurityStats](https://github.com/apache/spark/pull/7694/files#diff-5770a6f8f5b1a8386ec0592a59bd74d2R81).
 It stores information gain, impurity, prediction related data all in one data 
structure which make LearningNode simplicity. Meanwhile it simplify and 
optimize binsToBestSplit function.
I will fix some trivial issues after your reviews.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9277] [MLLIB] SparseVector constructor ...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7794#issuecomment-126363830
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][CORE][WIP] expose Netty network l...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-126363827
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9471] [ML] Multilayer Perceptron

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7621#issuecomment-126363850
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8862][SPARK-8862][SQL][WIP] Add basic i...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7774#issuecomment-126363846
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7648#issuecomment-126363835
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-8064, build against Hive 1.2.1

2015-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/7191#issuecomment-126362290
  
Thanks @steveloughran I can take a crack at publishing to maven. Since that 
might take a day or so, one thing you can do is just put the forked hive jars 
in your people.apache.org web space and then add that as a repository to the 
build (a maven repository is just anything that can support HTTP downloading of 
the jars). In the mean time I can try to get publishing up and running.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7648#issuecomment-126364248
  
  [Test build #39054 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39054/consoleFull)
 for   PR 7648 at commit 
[`93b74f8`](https://github.com/apache/spark/commit/93b74f884ea17da65297e47ff9a20b53d93225d1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9308] [ML] ml.NaiveBayesModel support p...

2015-07-30 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/7672#issuecomment-126366729
  
@jkbradley I have reply your comments inline and update this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7648#issuecomment-126366528
  
  [Test build #39052 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39052/consoleFull)
 for   PR 7648 at commit 
[`7975b0c`](https://github.com/apache/spark/commit/7975b0c9703696653563d2b457b4ef071f30bfe9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9214] [ML] [PySpark] support ml.NaiveBa...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7568#issuecomment-126366493
  
  [Test build #39051 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39051/consoleFull)
 for   PR 7568 at commit 
[`f9c94d1`](https://github.com/apache/spark/commit/f9c94d1015e0e328aa265b86c9b95ec8185f9ba6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9248][SparkR] Closing curly-braces shou...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7795#issuecomment-126368860
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9248][SparkR] Closing curly-braces shou...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7795#issuecomment-126368754
  
  [Test build #39048 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39048/console)
 for   PR 7795 at commit 
[`c8eccd3`](https://github.com/apache/spark/commit/c8eccd3ce0c11ee1b8df36b666017c7bbfbf811f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-] [MLlib] minor fix on tokenizer doc

2015-07-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7791


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9277] [MLLIB] SparseVector constructor ...

2015-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7794#discussion_r35883632
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala ---
@@ -57,6 +57,21 @@ class VectorsSuite extends SparkFunSuite with Logging {
 assert(vec.values === values)
   }
 
+  test(sparse vector construction with mismatched indices/values array) {
+intercept[IllegalArgumentException] {
+  Vectors.sparse(4, Array(1,2,3), Array(3.0, 5.0, 7.0, 9.0))
--- End diff --

space after `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9408] [PySpark] [MLlib] Refactor linalg...

2015-07-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7746#issuecomment-126371427
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9277] [MLLIB] SparseVector constructor ...

2015-07-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7794#discussion_r35884108
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala ---
@@ -57,6 +57,21 @@ class VectorsSuite extends SparkFunSuite with Logging {
 assert(vec.values === values)
   }
 
+  test(sparse vector construction with mismatched indices/values array) {
+intercept[IllegalArgumentException] {
+  Vectors.sparse(4, Array(1,2,3), Array(3.0, 5.0, 7.0, 9.0))
--- End diff --

Oh duh, fix coming. Every time I think I can't possibly need to run 
scalastyle as well as the test ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9277] [MLLIB] SparseVector constructor ...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7794#issuecomment-126373365
  
  [Test build #39061 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39061/consoleFull)
 for   PR 7794 at commit 
[`e8dc31e`](https://github.com/apache/spark/commit/e8dc31e899148989027b3a72e47a803a368a9881).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8735] [WIP] [SQL] Expose memory usage f...

2015-07-30 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/7770#issuecomment-126373212
  
The information exposed in this patch will be tied to accumulators on the 
SQL tab introduced in #7774


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6485] [MLlib] [Python] Add CoordinateMa...

2015-07-30 Thread dusenberrymw
Github user dusenberrymw commented on the pull request:

https://github.com/apache/spark/pull/7554#issuecomment-126347627
  
Thanks, @MechCoder!  I say we go ahead and optimize the conversions now 
though while this is still open. I'm thinking that adding an optional 
`java_matrix` parameter to the constructors will be the way to go.  Then, if an 
argument for that is present, we can just store that internally, rather than 
create a new Java matrix. 

cc @mengxr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126350949
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126351527
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7368][MLlib] Add QR decomposition for R...

2015-07-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/5909#issuecomment-126355388
  
LGTM. Merged into master. Thanks! Sorry for the long delay on code review! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9308] [ML] ml.NaiveBayesModel support p...

2015-07-30 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/7672#discussion_r35879500
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
@@ -129,29 +129,49 @@ class NaiveBayesModel private[ml] (
   throw new UnknownError(sInvalid modelType: ${$(modelType)}.)
   }
 
-  override protected def predict(features: Vector): Double = {
+  override val numClasses: Int = pi.size
+
+  private def posteriorProbabilities(logProb: DenseVector) = {
+val logProbArray = logProb.toArray
+val maxLog = logProbArray.max
+val scaledProbs = logProbArray.map(lp = math.exp(lp - maxLog))
+val probSum = scaledProbs.sum
+new DenseVector(scaledProbs.map(_ / probSum))
+  }
+
+  private def multinomialCalculation(testData: Vector) = {
+val prob = theta.multiply(testData)
+BLAS.axpy(1.0, pi, prob)
+prob
+  }
+
+  private def bernoulliCalculation(testData: Vector) = {
+testData.foreachActive((_, value) =
+  if (value != 0.0  value != 1.0) {
+throw new SparkException(
+  sBernoulli naive Bayes requires 0 or 1 feature values but found 
$testData.)
+  }
+)
+val prob = thetaMinusNegTheta.get.multiply(testData)
+BLAS.axpy(1.0, pi, prob)
+BLAS.axpy(1.0, negThetaSum.get, prob)
+prob
+  }
+
+  override protected def predictRaw(features: Vector): Vector = {
--- End diff --

Agree, done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5561] [mllib] Generalized PeriodicCheck...

2015-07-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7728#issuecomment-126359650
  
LGTM. Merged into master. Thanks! Btw, it is not necessary to specify the 
item type of RDD or Graph. Checkpointing doesn't care the item type. Maybe we 
can try `RDD[_]` and `Graph[_, _]`, which might simplify the code a little bit 
(if it compiles).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9214] [ML] [PySpark] support ml.NaiveBa...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7568#issuecomment-126367129
  
  [Test build #39051 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39051/console)
 for   PR 7568 at commit 
[`f9c94d1`](https://github.com/apache/spark/commit/f9c94d1015e0e328aa265b86c9b95ec8185f9ba6).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class NaiveBayes(JavaEstimator, HasFeaturesCol, HasLabelCol, 
HasPredictionCol):`
  * `class NaiveBayesModel(JavaModel):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8998][MLlib] Distribute PrefixSpan comp...

2015-07-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7783#issuecomment-126367159
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126367420
  
It looks like unrelated failure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9277] [MLLIB] SparseVector constructor ...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7794#issuecomment-126366497
  
  [Test build #39046 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39046/consoleFull)
 for   PR 7794 at commit 
[`6ffe34a`](https://github.com/apache/spark/commit/6ffe34a560829ac0e1f85b92f958ab394b1dda7a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126366153
  
  [Test build #39049 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39049/console)
 for   PR 7694 at commit 
[`fbbe2ec`](https://github.com/apache/spark/commit/fbbe2ecd463dc8d219080fdd8649f92b9fdf38c5).
 * This patch **fails Python style tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `class ImpurityStats(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126366037
  
  [Test build #165 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/165/console)
 for   PR 7694 at commit 
[`fbbe2ec`](https://github.com/apache/spark/commit/fbbe2ecd463dc8d219080fdd8649f92b9fdf38c5).
 * This patch **fails Python style tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `class ImpurityStats(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8998][MLlib] Distribute PrefixSpan comp...

2015-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7783#discussion_r35881343
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -78,81 +97,153 @@ class PrefixSpan private (
* the value of pair is the pattern's count.
*/
   def run(sequences: RDD[Array[Int]]): RDD[(Array[Int], Long)] = {
+val sc = sequences.sparkContext
+
 if (sequences.getStorageLevel == StorageLevel.NONE) {
   logWarning(Input data is not cached.)
 }
-val minCount = getMinCount(sequences)
-val lengthOnePatternsAndCounts =
-  getFreqItemAndCounts(minCount, sequences).collect()
-val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
-  lengthOnePatternsAndCounts.map(_._1), sequences)
-val groupedProjectedDatabase = prefixAndProjectedDatabase
-  .map(x = (x._1.toSeq, x._2))
-  .groupByKey()
-  .map(x = (x._1.toArray, x._2.toArray))
-val nextPatterns = getPatternsInLocal(minCount, 
groupedProjectedDatabase)
-val lengthOnePatternsAndCountsRdd =
-  sequences.sparkContext.parallelize(
-lengthOnePatternsAndCounts.map(x = (Array(x._1), x._2)))
-val allPatterns = lengthOnePatternsAndCountsRdd ++ nextPatterns
-allPatterns
+
+// Convert min support to a min number of transactions for this dataset
+val minCount = if (minSupport == 0) 0L else 
math.ceil(sequences.count() * minSupport).toLong
+
+// (Frequent items - number of occurrences, all items here satisfy 
the `minSupport` threshold
+val freqItemCounts = sequences
+  .flatMap(seq = seq.distinct.map(item = (item, 1L)))
+  .reduceByKey(_ + _)
+  .filter(_._2 = minCount)
+  .collect()
+
+// Pairs of (length 1 prefix, suffix consisting of frequent items)
+val itemSuffixPairs = {
+  val freqItems = freqItemCounts.map(_._1).toSet
+  sequences.flatMap { seq =
+val filteredSeq = seq.filter(freqItems.contains(_))
+freqItems.flatMap { item =
+  val candidateSuffix = LocalPrefixSpan.getSuffix(item, 
filteredSeq)
+  candidateSuffix match {
+case suffix if !suffix.isEmpty = Some((List(item), suffix))
+case _ = None
+  }
+}
+  }
+}
+
+// Accumulator for the computed results to be returned, initialized to 
the frequent items (i.e.
+// frequent length-one prefixes)
+var resultsAccumulator = freqItemCounts.map(x = (List(x._1), x._2))
+
+// Remaining work to be locally and distributively processed 
respectfully
+var (pairsForLocal, pairsForDistributed) = 
partitionByProjDBSize(itemSuffixPairs)
+
+// Continue processing until no pairs for distributed processing 
remain (i.e. all prefixes have
+// projected database sizes = `maxLocalProjDBSize`)
+while (pairsForDistributed.count() != 0) {
+  val (nextPatternAndCounts, nextPrefixSuffixPairs) =
+extendPrefixes(minCount, pairsForDistributed)
+  pairsForDistributed.unpersist()
+  val (smallerPairsPart, largerPairsPart) = 
partitionByProjDBSize(nextPrefixSuffixPairs)
+  pairsForDistributed = largerPairsPart
+  pairsForDistributed.persist(StorageLevel.MEMORY_AND_DISK)
+  pairsForLocal ++= smallerPairsPart
+  resultsAccumulator ++= nextPatternAndCounts.collect()
+}
+
+// Process the small projected databases locally
+val remainingResults = getPatternsInLocal(
+  minCount, sc.parallelize(pairsForLocal, 1).groupByKey())
+
+(sc.parallelize(resultsAccumulator, 1) ++ remainingResults)
+  .map { case (pattern, count) = (pattern.toArray, count) }
   }
 
+
   /**
-   * Get the minimum count (sequences count * minSupport).
-   * @param sequences input data set, contains a set of sequences,
-   * @return minimum count,
+   * Partitions the prefix-suffix pairs by projected database size.
+   * @param prefixSuffixPairs prefix (length n) and suffix pairs,
+   * @return prefix-suffix pairs partitioned by whether their projected 
database size is = or
+   * greater than [[maxLocalProjDBSize]]
*/
-  private def getMinCount(sequences: RDD[Array[Int]]): Long = {
-if (minSupport == 0) 0L else math.ceil(sequences.count() * 
minSupport).toLong
+  private def partitionByProjDBSize(prefixSuffixPairs: RDD[(List[Int], 
Array[Int])])
+: (Array[(List[Int], Array[Int])], RDD[(List[Int], Array[Int])]) = {
+val prefixToSuffixSize = prefixSuffixPairs
+  .aggregateByKey(0)(
+seqOp = { case (count, suffix) = count + suffix.length },
+combOp = { _ + _ })
+val 

[GitHub] spark pull request: [SPARK-9308] [ML] ml.NaiveBayesModel support p...

2015-07-30 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/7672#discussion_r35881182
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala 
---
@@ -46,6 +51,44 @@ class NaiveBayesSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(model.theta.map(math.exp) ~== thetaData.map(math.exp) absTol 
0.05, theta mismatch)
   }
 
+  def expectedMultinomialProbabilities(model: NaiveBayesModel, feature: 
Vector): Vector = {
--- End diff --

Like above, the ml.NaiveBayesModel parameters are all based on Vector and 
Matrix which is different from the old one, so I just want to make some 
facility test functions on this kinds of model rather than converting it to old 
style model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9359][SQL] Support IntervalType for Par...

2015-07-30 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/7793#issuecomment-126369508
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9359][SQL] Support IntervalType for Par...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7793#issuecomment-126370367
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9359][SQL] Support IntervalType for Par...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7793#issuecomment-126370384
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9359][SQL] Support IntervalType for Par...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7793#issuecomment-126370562
  
  [Test build #39058 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39058/consoleFull)
 for   PR 7793 at commit 
[`ad46986`](https://github.com/apache/spark/commit/ad4698629f005b113a0f02c2f8a1faa32a8f8aaa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5561] [mllib] Generalized PeriodicCheck...

2015-07-30 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/7728#issuecomment-126372592
  
@jkbradley thanks, this is actually not affected by the recent 
checkpointing changes since we keep the old code path. In the future you can 
switch to calling `rdd.localCheckpoint()` and suddenly everything will be a 
little faster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9479][Streaming][Tests]Fix ReceiverTrac...

2015-07-30 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/7797

[SPARK-9479][Streaming][Tests]Fix ReceiverTrackerSuite failure

See https://issues.apache.org/jira/browse/SPARK-9479 for the failure cause.

The PR includes the following changes:
1. Make ReceiverTrackerSuite create StreamingContext in the test body.
2. Fix places that don't stop StreamingContext. I verified no SparkContext 
was stopped in the shutdown hook locally after this fix.
3. Fix an issue that `ReceiverTracker.endpoint` may be null.
4. Make sure stopping SparkContext in non-main thread won't fail other 
tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark fix-ReceiverTrackerSuite

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7797.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7797


commit d7497df154ac8f44662e5511c70d43fd79f9eabb
Author: zsxwing zsxw...@gmail.com
Date:   2015-07-30T15:16:53Z

Fix ReceiverTrackerSuite; make sure StreamingContext in tests is closed




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9408] [PySpark] [MLlib] Refactor linalg...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7746#issuecomment-126354452
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7937][SQL] Support comparison on Struct...

2015-07-30 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/6519#issuecomment-126357418
  
ping @rxin any further comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5561] [mllib] Generalized PeriodicCheck...

2015-07-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7728


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9471] [ML] Multilayer Perceptron

2015-07-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7621#issuecomment-126360987
  
Thanks! The branch name doesn't matter:) I will make another pass today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9277] [MLLIB] SparseVector constructor ...

2015-07-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7794#issuecomment-126360771
  
LGTM. I think it is useful to add the same check to Python. @MechCoder 
could you add it after #7746 ? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9308] [ML] ml.NaiveBayesModel support p...

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7672#issuecomment-126363829
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7648#issuecomment-126363831
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][CORE][WIP] expose Netty network l...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-126364386
  
  [Test build #164 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/164/consoleFull)
 for   PR 7753 at commit 
[`17e5b97`](https://github.com/apache/spark/commit/17e5b978618a5a6adfa3ff621e37eeecaa0b2b0c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-6885] [ML] decision tree support ...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7694#issuecomment-126364431
  
  [Test build #165 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/165/consoleFull)
 for   PR 7694 at commit 
[`fbbe2ec`](https://github.com/apache/spark/commit/fbbe2ecd463dc8d219080fdd8649f92b9fdf38c5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8979] Add a PID based rate estimator

2015-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7648#issuecomment-126363847
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8862][SPARK-8862][SQL][WIP] Add basic i...

2015-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7774#issuecomment-126364218
  
  [Test build #163 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/163/consoleFull)
 for   PR 7774 at commit 
[`23abf73`](https://github.com/apache/spark/commit/23abf73cafac3af0363486bdae91d737e235a197).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >