[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-30 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15628373
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingRegression.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, DeveloperApi}
+import org.apache.spark.streaming.dstream.DStream
+
+/**
+ * :: DeveloperApi ::
+ * StreamingRegression implements methods for training
+ * a linear regression model on streaming data, and using it
+ * for prediction on streaming data.
+ *
+ * This class takes as type parameters a GeneralizedLinearModel,
+ * and a GeneralizedLinearAlgorithm, making it easy to extend to construct
+ * streaming versions of arbitrary regression analyses. For example usage,
+ * see StreamingLinearRegressionWithSGD.
+ *
+ */
+@DeveloperApi
+@Experimental
+abstract class StreamingRegression[
+M <: GeneralizedLinearModel,
+A <: GeneralizedLinearAlgorithm[M]] extends Logging {
+
+  /** The model to be updated and used for prediction. */
+  var model: M
+
+  /** The algorithm to use for updating. */
+  val algorithm: A
+
+  /** Return the latest model. */
+  def latest(): M = {
+model
+  }
+
+  /**
+   * Update the model by training on batches of data from a DStream.
+   * This operation registers a DStream for training the model,
+   * and updates the model based on every subsequent non-empty
+   * batch of data from the stream.
+   *
+   * @param data DStream containing labeled data
+   */
+  def trainOn(data: DStream[LabeledPoint]) {
+data.foreachRDD{
+  rdd =>
+if (rdd.count() > 0) {
+  model = algorithm.run(rdd, model.weights)
+  logInfo("Model updated")
+}
+logInfo("Current model: weights, 
%s".format(model.weights.toString))
+logInfo("Current model: intercept, 
%s".format(model.intercept.toString))
--- End diff --

Yup, I noticed this. It could also work to call 
``setIntercept(addIntercept=true)`` where the algorithm is defined (e.g. within 
``StreamingLinearRegressionWithSGD``), and have a setter to control this. 
Estimating an intercept from scratch on each update should be well constrained 
because we'll be starting from the current weights.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] Hadoop table scan bug fixin...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1669#issuecomment-50721673
  
QA results for PR 1669:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17556/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...

2014-07-30 Thread jey
Github user jey commented on the pull request:

https://github.com/apache/spark/pull/1680#issuecomment-50721572
  
I agree with this design; the preforking was basically vestigial and should 
have been removed. I'll review this PR later this week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-30 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15628354
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionSuite.scala
 ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.File
+
+import com.google.common.io.Files
+import org.apache.commons.io.FileUtils
+import org.scalatest.FunSuite
+import org.apache.spark.SparkConf
+import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}
+import org.apache.spark.mllib.util.{MLStreamingUtils, LinearDataGenerator, 
LocalSparkContext}
+
+import scala.collection.mutable.ArrayBuffer
+
+class StreamingLinearRegressionSuite extends FunSuite {
+
+  // Assert that two values are equal within tolerance epsilon
+  def assertEqual(v1: Double, v2: Double, epsilon: Double) {
+def errorMessage = v1.toString + " did not equal " + v2.toString
+assert(math.abs(v1-v2) <= epsilon, errorMessage)
+  }
+
+  // Assert that model predictions are correct
+  def validatePrediction(predictions: Seq[Double], input: 
Seq[LabeledPoint]) {
+val numOffPredictions = predictions.zip(input).count { case 
(prediction, expected) =>
+  // A prediction is off if the prediction is more than 0.5 away from 
expected value.
+  math.abs(prediction - expected.label) > 0.5
+}
+// At least 80% of the predictions should be on.
+assert(numOffPredictions < input.length / 5)
+  }
+
+  // Test if we can accurately learn Y = 10*X1 + 10*X2 on streaming data
+  test("streaming linear regression parameter accuracy") {
+
+val conf = new SparkConf().setMaster("local").setAppName("streaming 
test")
+val testDir = Files.createTempDir()
+val numBatches = 10
+val ssc = new StreamingContext(conf, Seconds(1))
+val data = MLStreamingUtils.loadLabeledPointsFromText(ssc, 
testDir.toString)
+val model = StreamingLinearRegressionWithSGD.start(numFeatures=2, 
numIterations=50)
+
+model.trainOn(data)
+
+ssc.start()
+
+// write data to a file stream
+Thread.sleep(5000)
--- End diff --

MIght not be =) I added it because I saw it in the streaming test suite for 
file writing, but without it both tests still pass fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-30 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15628342
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionSuite.scala
 ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.File
+
+import com.google.common.io.Files
+import org.apache.commons.io.FileUtils
+import org.scalatest.FunSuite
+import org.apache.spark.SparkConf
+import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}
+import org.apache.spark.mllib.util.{MLStreamingUtils, LinearDataGenerator, 
LocalSparkContext}
+
+import scala.collection.mutable.ArrayBuffer
+
+class StreamingLinearRegressionSuite extends FunSuite {
--- End diff --

I wanted to do this but it doesn't seem to work. The first test always 
passes, but the second test immediately hits this error. Maybe because we're 
effectively starting multiple StreamingContexts from the same SparkContext? I'm 
trying to debug. @tdas ?

java.lang.NullPointerException:
  at 
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:159)
  at 
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:66)
  at 
org.apache.spark.mllib.regression.StreamingLinearRegressionSuite$$anonfun$2.apply$mcV$sp(StreamingLinearRegressionSuite.scala:92)
  at 
org.apache.spark.mllib.regression.StreamingLinearRegressionSuite$$anonfun$2.apply(StreamingLinearRegressionSuite.scala:89)
  at 
org.apache.spark.mllib.regression.StreamingLinearRegressionSuite$$anonfun$2.apply(StreamingLinearRegressionSuite.scala:89)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...

2014-07-30 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1680#issuecomment-50721351
  
Why we had multiple listen process before? How about the performance of 
fork() in EC2? I met that fork() will take 200ms in Xen VM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add cacheTable guide

2014-07-30 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1681#issuecomment-50721262
  
LGTM, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-30 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15628281
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingRegression.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, DeveloperApi}
+import org.apache.spark.streaming.dstream.DStream
+
+/**
+ * :: DeveloperApi ::
+ * StreamingRegression implements methods for training
+ * a linear regression model on streaming data, and using it
+ * for prediction on streaming data.
+ *
+ * This class takes as type parameters a GeneralizedLinearModel,
+ * and a GeneralizedLinearAlgorithm, making it easy to extend to construct
+ * streaming versions of arbitrary regression analyses. For example usage,
+ * see StreamingLinearRegressionWithSGD.
+ *
+ */
+@DeveloperApi
+@Experimental
+abstract class StreamingRegression[
+M <: GeneralizedLinearModel,
+A <: GeneralizedLinearAlgorithm[M]] extends Logging {
+
+  /** The model to be updated and used for prediction. */
+  var model: M
+
+  /** The algorithm to use for updating. */
+  val algorithm: A
+
+  /** Return the latest model. */
+  def latest(): M = {
+model
+  }
+
+  /**
+   * Update the model by training on batches of data from a DStream.
+   * This operation registers a DStream for training the model,
+   * and updates the model based on every subsequent non-empty
+   * batch of data from the stream.
+   *
+   * @param data DStream containing labeled data
+   */
+  def trainOn(data: DStream[LabeledPoint]) {
+data.foreachRDD{
+  rdd =>
+if (rdd.count() > 0) {
+  model = algorithm.run(rdd, model.weights)
+  logInfo("Model updated")
--- End diff --

great idea!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...

2014-07-30 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1680#discussion_r15628262
  
--- Diff: python/pyspark/daemon.py ---
@@ -176,13 +120,30 @@ def handle_sigchld(*args):
 try:
 while not should_exit():
 try:
-# Spark tells us to exit by closing stdin
-if os.read(0, 512) == '':
-shutdown()
-except EnvironmentError as err:
-if err.errno != EINTR:
-shutdown()
+ready_fds = select.select([0, listen_sock], [], [])[0]
+except select.error as ex:
+if ex[0] == 4:
+continue
+else:
 raise
+if 0 in ready_fds:
+# Spark told us to exit by closing stdin
+shutdown()
+if listen_sock in ready_fds:
+sock, addr = listen_sock.accept()
+# Launch a worker process
+if os.fork() == 0:
+listen_sock.close()
+try:
+worker(sock)
+except:
+traceback.print_exc()
+os._exit(1)
+else:
+assert should_exit()
--- End diff --

Why need this? It should exit after task finish, should_exit() will False, 
then it will set exit_flag to True, all process will die.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2752]spark sql cli should not exit when...

2014-07-30 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1661#issuecomment-50721059
  
LGTM except some styling issues. Thanks for working on this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15628238
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -845,33 +842,15 @@ object DecisionTree extends Serializable with Logging 
{
 }
   }
 
-  if (leftTotalCount == 0) {
-return new InformationGainStats(0, topImpurity, topImpurity, 
Double.MinValue, 1)
-  }
-  if (rightTotalCount == 0) {
-return new InformationGainStats(0, topImpurity, 
Double.MinValue, topImpurity, 1)
-  }
-
-  val leftImpurity = strategy.impurity.calculate(leftCounts, 
leftTotalCount)
-  val rightImpurity = strategy.impurity.calculate(rightCounts, 
rightTotalCount)
-
-  val leftWeight = leftTotalCount / (leftTotalCount + 
rightTotalCount)
-  val rightWeight = rightTotalCount / (leftTotalCount + 
rightTotalCount)
-
-  val gain = {
-if (level > 0) {
-  impurity - leftWeight * leftImpurity - rightWeight * 
rightImpurity
-} else {
-  impurity - leftWeight * leftImpurity - rightWeight * 
rightImpurity
-}
-  }
-
   val totalCount = leftTotalCount + rightTotalCount
+  if (totalCount == 0) {
+// Return arbitrary prediction.
+return new InformationGainStats(0, topImpurity, topImpurity, 
topImpurity, 0)
+  }
 
   // Sum of count for each label
-  val leftRightCounts: Array[Double]
-= leftCounts.zip(rightCounts)
-  .map{case (leftCount, rightCount) => leftCount + rightCount}
+  val leftRightCounts: Array[Double] = 
leftCounts.zip(rightCounts).map {
+  case (leftCount, rightCount) => leftCount + rightCount }
--- End diff --

It may be more consistent to write this way:

~~~
val leftRightCounts: Array[Double] =
  leftCounts.zip(rightCounts).map { case (leftCount, rightCount) =>
leftCount + rightCount
  }
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2752]spark sql cli should not exit when...

2014-07-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1661#discussion_r15628242
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -288,33 +288,36 @@ private[hive] class SparkSQLCLIDriver extends 
CliDriver with Logging {
 out.println(cmd)
   }
 
-  ret = driver.run(cmd).getResponseCode
-  if (ret != 0) {
-driver.close()
-return ret
-  }
-
-  val res = new JArrayList[String]()
-
-  if (HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_CLI_PRINT_HEADER)) {
-// Print the column names.
-Option(driver.getSchema.getFieldSchemas).map { fields =>
-  out.println(fields.map(_.getName).mkString("\t"))
+  try {
+ret = driver.run(cmd).getResponseCode
+if (ret != 0) {
+  driver.close()
+   return ret
 }
-  }
+val res = new JArrayList[String]()
+
+if (HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_CLI_PRINT_HEADER)) {
+ // Print the column names.
+ Option(driver.getSchema.getFieldSchemas).map { fields =>
+   out.println(fields.map(_.getName).mkString("\t"))
+ }
+   }
 
-  try {
 while (!out.checkError() && driver.getResults(res)) {
   res.foreach(out.println)
   res.clear()
 }
   } catch {
-case e:IOException =>
+case e: IOException =>
--- End diff --

I think we can simply change `IOException` to `Exception` and remove the 
next `case e: Exception` clause to make this more concise. Also, the 
`stringifyException(e)` call can be useful for diagnosing SQL 
statements/scripts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...

2014-07-30 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1680#discussion_r15628219
  
--- Diff: python/pyspark/daemon.py ---
@@ -176,13 +120,30 @@ def handle_sigchld(*args):
 try:
 while not should_exit():
 try:
-# Spark tells us to exit by closing stdin
-if os.read(0, 512) == '':
-shutdown()
-except EnvironmentError as err:
-if err.errno != EINTR:
-shutdown()
+ready_fds = select.select([0, listen_sock], [], [])[0]
--- End diff --

select() should have an timeout here, so after SIGTERM, it will not exit 
until events happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2712 - Add a small note to maven doc tha...

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1615#issuecomment-50720880
  
maybe it would make sense to just enhance the existing section and clearly 
show that you need to run "mvn package" and then "mvn test" with a similar code 
block to the one in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15628197
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -815,20 +822,10 @@ object DecisionTree extends Serializable with Logging 
{
 topImpurity: Double): InformationGainStats = {
   strategy.algo match {
 case Classification =>
-  var classIndex = 0
-  val leftCounts: Array[Double] = new Array[Double](numClasses)
-  val rightCounts: Array[Double] = new Array[Double](numClasses)
-  var leftTotalCount = 0.0
-  var rightTotalCount = 0.0
-  while (classIndex < numClasses) {
-val leftClassCount = 
leftNodeAgg(featureIndex)(splitIndex)(classIndex)
-val rightClassCount = 
rightNodeAgg(featureIndex)(splitIndex)(classIndex)
-leftCounts(classIndex) = leftClassCount
-leftTotalCount += leftClassCount
-rightCounts(classIndex) = rightClassCount
-rightTotalCount += rightClassCount
-classIndex += 1
-  }
+  val leftCounts: Array[Double] = 
leftNodeAgg(featureIndex)(splitIndex)
+  val rightCounts: Array[Double] = 
rightNodeAgg(featureIndex)(splitIndex)
+  var leftTotalCount = leftCounts.sum
+  var rightTotalCount = rightCounts.sum
--- End diff --

Those two values could be `val`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2712 - Add a small note to maven doc tha...

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1615#issuecomment-50720702
  
@rxin @javadba - I believe this is covered already in the maven docs:


http://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15628159
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -612,27 +615,31 @@ object DecisionTree extends Serializable with Logging 
{
   agg(aggIndex + labelInt) = agg(aggIndex + labelInt) + 1
 }
 
-def updateBinForUnorderedFeature(nodeIndex: Int, featureIndex: Int, 
arr: Array[Double],
-label: Double, agg: Array[Double], rightChildShift: Int) = {
+def updateBinForUnorderedFeature(
+nodeIndex: Int,
+featureIndex: Int,
+arr: Array[Double],
+label: Double,
+agg: Array[Double],
+rightChildShift: Int) = {
--- End diff --

Your comments in the PR description is very helpful to understand the code 
the indexing of `agg`. Do you mind copying them here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] [WIP] Scala 2.11 support

2014-07-30 Thread avati
Github user avati commented on the pull request:

https://github.com/apache/spark/pull/1649#issuecomment-50720340
  
I will split this up into multiple PRs.

I will start with a small pull request for akka upgrade to 2.3. However as 
Sean mentioned, I guess we are stuck on the fundamental incompatibility with 
Hadoop 1. Either we will need to forego Hadoop 1 compatibility, or wait till a 
-shaded-protobuf version of akka artifact is released (like 
akka-2.2.3-shaded-protobuf from org.spark-project.akka), or temporarily forego 
compatibility and fixup pom.xml later if/when a shaded-protobuf artifact is 
released.

I would like to hear opinions about this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Remove unnecessary Code from spark-shell.cmd

2014-07-30 Thread techaddict
Github user techaddict closed the pull request at:

https://github.com/apache/spark/pull/666


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Remove unnecessary Code from spark-shell.cmd

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/666#issuecomment-50719523
  
Close this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627927
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -598,9 +598,12 @@ object DecisionTree extends Serializable with Logging {
  // Find feature bins for all nodes at a level.
 val binMappedRDD = input.map(x => findBinsForLevel(x))
 
-def updateBinForOrderedFeature(arr: Array[Double], agg: Array[Double], 
nodeIndex: Int,
-label: Double, featureIndex: Int) = {
-
+def updateBinForOrderedFeature(
+arr: Array[Double],
+agg: Array[Double],
+nodeIndex: Int,
+label: Double,
+featureIndex: Int) = {
--- End diff --

Could you help remove `=` since this function returns nothing? (Same for 
`updateBinForOrderedFeature`.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627893
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -48,11 +50,13 @@ object DecisionTreeRunner {
 
   case class Params(
   input: String = null,
+  dataFormat: String = null,
   algo: Algo = Classification,
   numClassesForClassification: Int = 2,
--- End diff --

If `numClassesForClassification` is removed, we also need to remove it here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1740] [PySpark] kill the python worker

2014-07-30 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1643#issuecomment-50718956
  
I will wait for your patch, and think about using PIDs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2098: All Spark processes should support...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1256#issuecomment-50718809
  
QA tests have started for PR 1256. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17564/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1678#issuecomment-50718818
  
QA tests have started for PR 1678. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17565/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1740] [PySpark] kill the python worker

2014-07-30 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1643#issuecomment-50718757
  
Good question, it's dangerous to mix threads and fork(), it may be cause 
dead lock in child process. But in this case, because of GIL, then fork() 
happens, monitor thread is blocked or sleeping or polling, they are thread 
safe, so it will be a problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627842
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -100,16 +111,57 @@ object DecisionTreeRunner {
   }
 
   def run(params: Params) {
+
 val conf = new SparkConf().setAppName("DecisionTreeRunner")
 val sc = new SparkContext(conf)
 
 // Load training data and cache it.
-val examples = MLUtils.loadLabeledPoints(sc, params.input).cache()
+val origExamples = params.dataFormat match {
+  case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache()
+  case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass 
= true).cache()
+}
+// For classification, re-index classes if needed.
+val (examples, numClasses) = params.algo match {
+  case Classification => {
+// classCounts: class --> # examples in class
+val classCounts = origExamples.map(_.label).countByValue
--- End diff --

add `()` to `countByValue` because it is an action (that triggers I/O).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2098: All Spark processes should support...

2014-07-30 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1256#issuecomment-50718557
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627787
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -100,16 +111,57 @@ object DecisionTreeRunner {
   }
 
   def run(params: Params) {
+
 val conf = new SparkConf().setAppName("DecisionTreeRunner")
 val sc = new SparkContext(conf)
 
 // Load training data and cache it.
-val examples = MLUtils.loadLabeledPoints(sc, params.input).cache()
+val origExamples = params.dataFormat match {
+  case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache()
+  case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass 
= true).cache()
+}
+// For classification, re-index classes if needed.
+val (examples, numClasses) = params.algo match {
+  case Classification => {
+// classCounts: class --> # examples in class
+val classCounts = origExamples.map(_.label).countByValue
+val numClasses = classCounts.size
+// classIndex: class --> index in 0,...,numClasses-1
+val classIndex = {
+  if (classCounts.keySet != Set[Double](0.0, 1.0)) {
+classCounts.keys.toList.sorted.zipWithIndex.toMap
+  } else {
+Map[Double, Int]()
+  }
+}
+val examples = {
+  if (classIndex.isEmpty) {
+origExamples
+  } else {
+origExamples.map(lp => LabeledPoint(classIndex(lp.label), 
lp.features))
+  }
+}
+println(s"numClasses = $numClasses.")
+println(s"Per-class example fractions, counts:")
+println(s"Class\tFrac\tCount")
+classCounts.keys.toList.sorted.foreach(c => {
+  val frac = classCounts(c) / (0.0 + examples.count())
--- End diff --

`examples.count()` is the same as `classCounts.values.sum`. (This is inside 
the foreach loop.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-30 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1678#issuecomment-50718425
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1635#issuecomment-50718130
  
QA results for PR 1635:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17558/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1678#issuecomment-50718100
  
QA results for PR 1678:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17553/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] [WIP] Scala 2.11 support

2014-07-30 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1649#issuecomment-50718152
  
Hey Anand, 
 This looks like a very nice effort !, It would be very convenient if you 
can break this pull request into smaller subtasks. 
  For example upgrading akka can be one subtask, and so on. That way it is 
easier for us to review/test and then merge. If however you think that you will 
not have the time for that I am happy to do it for you ? We would like to have 
this pretty soon - breaking up the PR will make reviewing and testing really 
easy for us. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2098: All Spark processes should support...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1256#issuecomment-50718143
  
QA results for PR 1256:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class SparkConf(loadDefaults: Boolean, fileName: 
Option[String])For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17555/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1671#issuecomment-50718111
  
QA results for PR 1671:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class HashingTF(val numFeatures: Int) extends Serializable 
{class IDF {class DocumentFrequencyAggregator extends Serializable 
{For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17557/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...

2014-07-30 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50717628
  
This feature will slow down normal access (by attribute or position), so I 
did not put it in. User still can use position to access the field with special 
names (such as keywords).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...

2014-07-30 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1578#issuecomment-50717700
  
localBlocksToFetch is a instance of ArrayBuffer, not concurrent queue and 
it is used from only one thread right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50717656
  
QA tests have started for PR 1598. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17563/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add cacheTable guide

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1681#issuecomment-50717598
  
QA tests have started for PR 1681. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17562/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627594
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -100,16 +111,57 @@ object DecisionTreeRunner {
   }
 
   def run(params: Params) {
+
 val conf = new SparkConf().setAppName("DecisionTreeRunner")
 val sc = new SparkContext(conf)
 
 // Load training data and cache it.
-val examples = MLUtils.loadLabeledPoints(sc, params.input).cache()
+val origExamples = params.dataFormat match {
+  case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache()
+  case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass 
= true).cache()
+}
+// For classification, re-index classes if needed.
+val (examples, numClasses) = params.algo match {
+  case Classification => {
+// classCounts: class --> # examples in class
+val classCounts = origExamples.map(_.label).countByValue
+val numClasses = classCounts.size
+// classIndex: class --> index in 0,...,numClasses-1
+val classIndex = {
+  if (classCounts.keySet != Set[Double](0.0, 1.0)) {
+classCounts.keys.toList.sorted.zipWithIndex.toMap
+  } else {
+Map[Double, Int]()
+  }
+}
+val examples = {
+  if (classIndex.isEmpty) {
+origExamples
+  } else {
+origExamples.map(lp => LabeledPoint(classIndex(lp.label), 
lp.features))
+  }
+}
+println(s"numClasses = $numClasses.")
+println(s"Per-class example fractions, counts:")
+println(s"Class\tFrac\tCount")
+classCounts.keys.toList.sorted.foreach(c => {
--- End diff --

`classCounts.keys.toList.sorted` is used twice.

In Spark, `foreach(c = {` is often written as `foreach { c =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2752]spark sql cli should not exit when...

2014-07-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1661#discussion_r15627597
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -288,33 +288,36 @@ private[hive] class SparkSQLCLIDriver extends 
CliDriver with Logging {
 out.println(cmd)
   }
 
-  ret = driver.run(cmd).getResponseCode
-  if (ret != 0) {
-driver.close()
-return ret
-  }
-
-  val res = new JArrayList[String]()
-
-  if (HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_CLI_PRINT_HEADER)) {
-// Print the column names.
-Option(driver.getSchema.getFieldSchemas).map { fields =>
-  out.println(fields.map(_.getName).mkString("\t"))
+  try {
+ret = driver.run(cmd).getResponseCode
+if (ret != 0) {
+  driver.close()
+   return ret
--- End diff --

And they cause compilation failure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627537
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -100,16 +111,57 @@ object DecisionTreeRunner {
   }
 
   def run(params: Params) {
+
 val conf = new SparkConf().setAppName("DecisionTreeRunner")
 val sc = new SparkContext(conf)
 
 // Load training data and cache it.
-val examples = MLUtils.loadLabeledPoints(sc, params.input).cache()
+val origExamples = params.dataFormat match {
+  case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache()
+  case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass 
= true).cache()
+}
+// For classification, re-index classes if needed.
+val (examples, numClasses) = params.algo match {
+  case Classification => {
+// classCounts: class --> # examples in class
+val classCounts = origExamples.map(_.label).countByValue
+val numClasses = classCounts.size
+// classIndex: class --> index in 0,...,numClasses-1
+val classIndex = {
--- End diff --

Would `classIndexMap` sound better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627527
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -100,16 +111,57 @@ object DecisionTreeRunner {
   }
 
   def run(params: Params) {
+
 val conf = new SparkConf().setAppName("DecisionTreeRunner")
 val sc = new SparkContext(conf)
 
 // Load training data and cache it.
-val examples = MLUtils.loadLabeledPoints(sc, params.input).cache()
+val origExamples = params.dataFormat match {
+  case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache()
+  case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass 
= true).cache()
+}
+// For classification, re-index classes if needed.
+val (examples, numClasses) = params.algo match {
+  case Classification => {
+// classCounts: class --> # examples in class
+val classCounts = origExamples.map(_.label).countByValue
+val numClasses = classCounts.size
+// classIndex: class --> index in 0,...,numClasses-1
+val classIndex = {
+  if (classCounts.keySet != Set[Double](0.0, 1.0)) {
--- End diff --

`[Double]` is not necessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627524
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -100,16 +111,57 @@ object DecisionTreeRunner {
   }
 
   def run(params: Params) {
+
 val conf = new SparkConf().setAppName("DecisionTreeRunner")
 val sc = new SparkContext(conf)
 
 // Load training data and cache it.
-val examples = MLUtils.loadLabeledPoints(sc, params.input).cache()
+val origExamples = params.dataFormat match {
+  case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache()
+  case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass 
= true).cache()
--- End diff --

`multiclass` was deprecated (today). Should be safe to remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1680#issuecomment-50717026
  
@aarondav if you can take a look you'd be great to review this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627514
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -69,25 +73,32 @@ object DecisionTreeRunner {
   opt[Int]("maxDepth")
 .text(s"max depth of the tree, default: ${defaultParams.maxDepth}")
 .action((x, c) => c.copy(maxDepth = x))
-  opt[Int]("numClassesForClassification")
-.text(s"number of classes for classification, "
-  + s"default: ${defaultParams.numClassesForClassification}")
-.action((x, c) => c.copy(numClassesForClassification = x))
   opt[Int]("maxBins")
 .text(s"max number of bins, default: ${defaultParams.maxBins}")
 .action((x, c) => c.copy(maxBins = x))
+  opt[Double]("fracTest")
+.text(s"fraction of data to hold out for testing, default: 
${defaultParams.fracTest}")
+.action((x, c) => c.copy(fracTest = x))
   arg[String]("")
 .text("input paths to labeled examples in dense format (label,f0 
f1 f2 ...)")
 .required()
 .action((x, c) => c.copy(input = x))
+  arg[String]("")
--- End diff --

Maybe we can make data format an optional argument (default to LibSVM). One 
reason is that the dense format is deprecated in v1.1. We support either LIBSVM 
format or MLlib's own format `(label,[v0,v1,...])` or 
`(label,[i0,i1,...],[vi0,vi1,...])`. Also, it is a little weird to have this as 
the second required argument after input data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627518
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -69,25 +73,32 @@ object DecisionTreeRunner {
   opt[Int]("maxDepth")
 .text(s"max depth of the tree, default: ${defaultParams.maxDepth}")
 .action((x, c) => c.copy(maxDepth = x))
-  opt[Int]("numClassesForClassification")
-.text(s"number of classes for classification, "
-  + s"default: ${defaultParams.numClassesForClassification}")
-.action((x, c) => c.copy(numClassesForClassification = x))
   opt[Int]("maxBins")
 .text(s"max number of bins, default: ${defaultParams.maxBins}")
 .action((x, c) => c.copy(maxBins = x))
+  opt[Double]("fracTest")
+.text(s"fraction of data to hold out for testing, default: 
${defaultParams.fracTest}")
+.action((x, c) => c.copy(fracTest = x))
   arg[String]("")
 .text("input paths to labeled examples in dense format (label,f0 
f1 f2 ...)")
 .required()
 .action((x, c) => c.copy(input = x))
+  arg[String]("")
+.text("data format: dense/libsvm")
+.required()
+.action((x, c) => c.copy(dataFormat = x))
   checkConfig { params =>
-if (params.algo == Classification &&
-(params.impurity == Gini || params.impurity == Entropy)) {
-  success
-} else if (params.algo == Regression && params.impurity == 
Variance) {
-  success
+if (params.fracTest < 0 || params.fracTest > 1) {
+  failure(s"fracTest ${params.fracTest} value incorrect; should be 
in [0,1].")
--- End diff --

`(0, 1)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627517
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -69,25 +73,32 @@ object DecisionTreeRunner {
   opt[Int]("maxDepth")
 .text(s"max depth of the tree, default: ${defaultParams.maxDepth}")
 .action((x, c) => c.copy(maxDepth = x))
-  opt[Int]("numClassesForClassification")
-.text(s"number of classes for classification, "
-  + s"default: ${defaultParams.numClassesForClassification}")
-.action((x, c) => c.copy(numClassesForClassification = x))
   opt[Int]("maxBins")
 .text(s"max number of bins, default: ${defaultParams.maxBins}")
 .action((x, c) => c.copy(maxBins = x))
+  opt[Double]("fracTest")
+.text(s"fraction of data to hold out for testing, default: 
${defaultParams.fracTest}")
+.action((x, c) => c.copy(fracTest = x))
   arg[String]("")
 .text("input paths to labeled examples in dense format (label,f0 
f1 f2 ...)")
 .required()
 .action((x, c) => c.copy(input = x))
+  arg[String]("")
+.text("data format: dense/libsvm")
+.required()
+.action((x, c) => c.copy(dataFormat = x))
   checkConfig { params =>
-if (params.algo == Classification &&
-(params.impurity == Gini || params.impurity == Entropy)) {
-  success
-} else if (params.algo == Regression && params.impurity == 
Variance) {
-  success
+if (params.fracTest < 0 || params.fracTest > 1) {
--- End diff --

`... <= 0 || ... >= 1`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1673#discussion_r15627507
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -69,25 +73,32 @@ object DecisionTreeRunner {
   opt[Int]("maxDepth")
 .text(s"max depth of the tree, default: ${defaultParams.maxDepth}")
 .action((x, c) => c.copy(maxDepth = x))
-  opt[Int]("numClassesForClassification")
-.text(s"number of classes for classification, "
-  + s"default: ${defaultParams.numClassesForClassification}")
-.action((x, c) => c.copy(numClassesForClassification = x))
   opt[Int]("maxBins")
 .text(s"max number of bins, default: ${defaultParams.maxBins}")
 .action((x, c) => c.copy(maxBins = x))
+  opt[Double]("fracTest")
+.text(s"fraction of data to hold out for testing, default: 
${defaultParams.fracTest}")
+.action((x, c) => c.copy(fracTest = x))
   arg[String]("")
 .text("input paths to labeled examples in dense format (label,f0 
f1 f2 ...)")
--- End diff --

If we take LIBSVM format, we should update the doc here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1639


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2497] Included checks for module symbol...

2014-07-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1463


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1673#issuecomment-50716977
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]Hash Outer Join

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1147#issuecomment-50716934
  
QA tests have started for PR 1147. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17560/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15627468
  
--- Diff: python/pyspark/mllib/regression.py ---
@@ -109,18 +109,45 @@ class 
LinearRegressionModel(LinearRegressionModelBase):
 True
 """
 
-
 class LinearRegressionWithSGD(object):
 @classmethod
-def train(cls, data, iterations=100, step=1.0,
-  miniBatchFraction=1.0, initialWeights=None):
-"""Train a linear regression model on the given data."""
+def train(cls, data, iterations=100, step=1.0, regParam=1.0, 
regType=None,
+  intercept=False, miniBatchFraction=1.0, initialWeights=None):
+"""
+Train a linear regression model on the given data.
+
+@param data:  The training data.
+@param iterations:The number of iterations (default: 100).
+@param step:  The step parameter used in SGD
+  (default: 1.0).
+@param regParam:  The regularizer parameter (default: 1.0).
+@param regType:   The type of regularizer used for training
+  our model.
+  Allowed values: "l1" for using L1Updater,
+  "l2" for using
+   SquaredL2Updater,
+  "none" for no 
regularizer.
+  (default: None)
--- End diff --

It may be better to set the default to `"none"` and map `None` to `"none"` 
in the implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15627480
  
--- Diff: python/pyspark/mllib/regression.py ---
@@ -109,18 +109,45 @@ class 
LinearRegressionModel(LinearRegressionModelBase):
 True
 """
 
-
 class LinearRegressionWithSGD(object):
 @classmethod
-def train(cls, data, iterations=100, step=1.0,
-  miniBatchFraction=1.0, initialWeights=None):
-"""Train a linear regression model on the given data."""
+def train(cls, data, iterations=100, step=1.0, regParam=1.0, 
regType=None,
+  intercept=False, miniBatchFraction=1.0, initialWeights=None):
+"""
+Train a linear regression model on the given data.
+
+@param data:  The training data.
+@param iterations:The number of iterations (default: 100).
+@param step:  The step parameter used in SGD
+  (default: 1.0).
+@param regParam:  The regularizer parameter (default: 1.0).
+@param regType:   The type of regularizer used for training
+  our model.
+  Allowed values: "l1" for using L1Updater,
+  "l2" for using
+   SquaredL2Updater,
+  "none" for no 
regularizer.
+  (default: None)
+@param intercept: Boolean parameter which indicates the use
+  or not of the augmented representation 
for
+  training data (i.e. whether bias features
+  are activated or not).
+@param miniBatchFraction: Fraction of data to be used for each SGD
+  iteration.
+@param initialWeights:The initial weights (default: None).
+"""
 sc = data.context
-train_f = lambda d, i: 
sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
-d._jrdd, iterations, step, miniBatchFraction, i)
+if regType is None:
+train_f = lambda d, i: 
sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
+d._jrdd, iterations, step, regParam, "none", intercept, 
miniBatchFraction, i)
+elif regType == "l2" or regType == "l1" or regType == "none":
+train_f = lambda d, i: 
sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
+d._jrdd, iterations, step, regParam, regType, intercept, 
miniBatchFraction, i)
+else:
+raise ValueError("Invalid value for 'regType' parameter. Can 
only be initialized " +
+ "using the following string values: [l1, l2, 
none].")
 return _regression_train_wrapper(sc, train_f, 
LinearRegressionModel, data, initialWeights)
 
-
--- End diff --

ditto: please keep this empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15627471
  
--- Diff: python/pyspark/mllib/regression.py ---
@@ -109,18 +109,45 @@ class 
LinearRegressionModel(LinearRegressionModelBase):
 True
 """
 
-
 class LinearRegressionWithSGD(object):
 @classmethod
-def train(cls, data, iterations=100, step=1.0,
-  miniBatchFraction=1.0, initialWeights=None):
-"""Train a linear regression model on the given data."""
+def train(cls, data, iterations=100, step=1.0, regParam=1.0, 
regType=None,
+  intercept=False, miniBatchFraction=1.0, initialWeights=None):
+"""
+Train a linear regression model on the given data.
+
+@param data:  The training data.
+@param iterations:The number of iterations (default: 100).
+@param step:  The step parameter used in SGD
+  (default: 1.0).
+@param regParam:  The regularizer parameter (default: 1.0).
+@param regType:   The type of regularizer used for training
+  our model.
+  Allowed values: "l1" for using L1Updater,
+  "l2" for using
+   SquaredL2Updater,
+  "none" for no 
regularizer.
+  (default: None)
+@param intercept: Boolean parameter which indicates the use
+  or not of the augmented representation 
for
+  training data (i.e. whether bias features
+  are activated or not).
+@param miniBatchFraction: Fraction of data to be used for each SGD
+  iteration.
+@param initialWeights:The initial weights (default: None).
+"""
 sc = data.context
-train_f = lambda d, i: 
sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
-d._jrdd, iterations, step, miniBatchFraction, i)
+if regType is None:
+train_f = lambda d, i: 
sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
--- End diff --

To avoid having the long command twice, you can use

~~~
if regType is None:
  regType = "none"
if regType in {"l2", "l1", "none"}:
train_f = ...
else:
raise ...
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15627451
  
--- Diff: python/pyspark/mllib/regression.py ---
@@ -109,18 +109,45 @@ class 
LinearRegressionModel(LinearRegressionModelBase):
 True
 """
 
-
 class LinearRegressionWithSGD(object):
 @classmethod
-def train(cls, data, iterations=100, step=1.0,
-  miniBatchFraction=1.0, initialWeights=None):
-"""Train a linear regression model on the given data."""
+def train(cls, data, iterations=100, step=1.0, regParam=1.0, 
regType=None,
--- End diff --

To be safe, we shouldn't change the order of arguments, we can append new 
arguments `regParam`, `regType`, and `intercept` at the end. So if user used 
`train(data, 100, 1.0, 1.0, np.zeros(n))`, it still works in the new version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15627458
  
--- Diff: python/pyspark/mllib/regression.py ---
@@ -109,18 +109,45 @@ class 
LinearRegressionModel(LinearRegressionModelBase):
 True
 """
 
-
 class LinearRegressionWithSGD(object):
 @classmethod
-def train(cls, data, iterations=100, step=1.0,
-  miniBatchFraction=1.0, initialWeights=None):
-"""Train a linear regression model on the given data."""
+def train(cls, data, iterations=100, step=1.0, regParam=1.0, 
regType=None,
+  intercept=False, miniBatchFraction=1.0, initialWeights=None):
+"""
+Train a linear regression model on the given data.
+
+@param data:  The training data.
--- End diff --

It is not necessary to align the doc, especially when limited by 78 
characters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15627440
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -247,16 +248,24 @@ class PythonMLLibAPI extends Serializable {
   dataBytesJRDD: JavaRDD[Array[Byte]],
   numIterations: Int,
   stepSize: Double,
+  regParam: Double,
+  regType: String,
+  intercept: Boolean,
   miniBatchFraction: Double,
   initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
+val lrAlg = new LinearRegressionWithSGD()
+lrAlg.setIntercept(intercept)
+lrAlg.optimizer
+  .setNumIterations(numIterations)
+  .setRegParam(regParam)
+  .setStepSize(stepSize)
+if (regType == "l2")
+  lrAlg.optimizer.setUpdater(new SquaredL2Updater)
+else if (regType == "l1")
+  lrAlg.optimizer.setUpdater(new L1Updater)
--- End diff --

It is safer to add
~~~
else if (regType != "none")
  throw IllegalArgumentException("...")
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15627450
  
--- Diff: python/pyspark/mllib/regression.py ---
@@ -109,18 +109,45 @@ class 
LinearRegressionModel(LinearRegressionModelBase):
 True
 """
 
-
--- End diff --

We use two empty lines to separate methods in pyspark. (I don't know the 
exact reason ...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...

2014-07-30 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50716626
  
Sure, sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2766: ScalaReflectionSuite throw an lleg...

2014-07-30 Thread witgo
GitHub user witgo opened a pull request:

https://github.com/apache/spark/pull/1683

SPARK-2766:  ScalaReflectionSuite  throw an llegalArgumentException in JDK 6



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/spark SPARK-2766

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1683.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1683


commit ff0ec1fc188a81af5702deb82005578b8027c13b
Author: GuoQiang Li 
Date:   2014-07-31T05:59:38Z

ScalaReflectionSuite  throw an llegalArgumentException in JDK 6




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2752]spark sql cli should not exit when...

2014-07-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1661#discussion_r15627250
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -288,33 +288,36 @@ private[hive] class SparkSQLCLIDriver extends 
CliDriver with Logging {
 out.println(cmd)
   }
 
-  ret = driver.run(cmd).getResponseCode
-  if (ret != 0) {
-driver.close()
-return ret
-  }
-
-  val res = new JArrayList[String]()
-
-  if (HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_CLI_PRINT_HEADER)) {
-// Print the column names.
-Option(driver.getSchema.getFieldSchemas).map { fields =>
-  out.println(fields.map(_.getName).mkString("\t"))
+  try {
+ret = driver.run(cmd).getResponseCode
+if (ret != 0) {
+  driver.close()
+   return ret
--- End diff --

Hey, line 295, 300 to 304 contain illegal *Chinese* full-width space 
characters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1671#issuecomment-50715814
  
QA tests have started for PR 1671. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17557/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1635#issuecomment-50715812
  
QA tests have started for PR 1635. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17558/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

2014-07-30 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1671#discussion_r15627165
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
+import org.apache.spark.mllib.rdd.RDDFunctions._
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: Experimental ::
+ * Inverse document frequency (IDF).
+ * The standard formulation is used: `idf = log((m + 1) / (d(t) + 1))`, 
where `m` is the total
+ * number of documents and `d(t)` is the number of documents that contain 
term `t`.
+ */
+@Experimental
+class IDF {
+
+  // TODO: Allow different IDF formulations.
+
+  private var brzIdf: BDV[Double] = _
+
+  /**
+   * Computes the inverse document frequency.
+   * @param dataset an RDD of term frequency vectors
+   */
+  def fit(dataset: RDD[Vector]): this.type = {
+brzIdf = dataset.treeAggregate(new IDF.DocumentFrequencyAggregator)(
+  seqOp = (df, v) => df.add(v),
+  combOp = (df1, df2) => df1.merge(df2)
+).idf()
+this
+  }
+
+  /**
+   * Computes the inverse document frequency.
+   * @param dataset a JavaRDD of term frequency vectors
+   */
+  def fit(dataset: JavaRDD[Vector]): this.type = {
--- End diff --

Yes. In the Java test suite, I put 'idf.fit(...).transform(...)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] Hadoop table scan bug fixin...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1669#issuecomment-50715565
  
QA tests have started for PR 1669. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17556/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1635#issuecomment-50715523
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...

2014-07-30 Thread sarutak
Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/1578#discussion_r15627125
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala ---
@@ -199,15 +199,22 @@ object BlockFetcherIterator {
   // Get the local blocks while remote blocks are being fetched. Note 
that it's okay to do
   // these all at once because they will just memory-map some files, 
so they won't consume
   // any memory that might exceed our maxBytesInFlight
-  for (id <- localBlocksToFetch) {
-getLocalFromDisk(id, serializer) match {
-  case Some(iter) => {
-// Pass 0 as size since it's not in flight
-results.put(new FetchResult(id, 0, () => iter))
-logDebug("Got local block " + id)
-  }
-  case None => {
-throw new BlockException(id, "Could not get block " + id + " 
from local machine")
+  var fetchIndex = 0
+  try {
+for (id <- localBlocksToFetch) {
+
+  // getLocalFromDisk never return None but throws BlockException
+  val iter = getLocalFromDisk(id, serializer).get
+  // Pass 0 as size since it's not in flight
+  results.put(new FetchResult(id, 0, () => iter))
+  fetchIndex += 1
+  logDebug("Got local block " + id)
+}
+  } catch {
+case e: Exception => {
+  logError(s"Error occurred while fetching local blocks", e)
+  for (id <- localBlocksToFetch.drop(fetchIndex)) {
+results.put(new FetchResult(id, -1, null))
--- End diff --

Thank you for your comment, @mateiz .

> I wouldn't do drop and such on a ConcurrentQueue, since it might drop 
stuff other threads  were adding. Just do a results.put on the failed block and 
don't worry about dropping other ones. You can actually move the try/catch into 
the for loop and add a "return" at the bottom of the catch after adding this 
failing FetchResult.

But, if it returns from getLocalBlocks immediately rest of FetchResults is 
not set to results, and we waits on results.take() in next method forever 
right? results is a instance of LinkedBlockingQueue and take method is blocking 
method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...

2014-07-30 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1677#issuecomment-50715402
  
Added "closes" comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][SPARK-2212]Hash Outer Join

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1147#issuecomment-50715224
  
QA results for PR 1147:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):trait BinaryRepeatableIteratorNode extends BinaryNode 
{case class HashOuterJoin(For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17551/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2098: All Spark processes should support...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1256#issuecomment-50715208
  
QA tests have started for PR 1256. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17555/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 2017

2014-07-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1682#issuecomment-50715124
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] Hadoop table scan bug fixin...

2014-07-30 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1669#issuecomment-50715127
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1677#issuecomment-50715107
  
QA results for PR 1677:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17552/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

2014-07-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1648#discussion_r15627025
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -127,26 +110,15 @@ class HadoopRDD[K, V](
   private val createTime = new Date()
 
   // Returns a JobConf that will be used on slaves to obtain input splits 
for Hadoop reads.
-  protected def getJobConf(): JobConf = {
-val conf: Configuration = broadcastedConf.value.value
-if (conf.isInstanceOf[JobConf]) {
-  // A user-broadcasted JobConf was provided to the HadoopRDD, so 
always use it.
-  conf.asInstanceOf[JobConf]
-} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
-  // getJobConf() has been called previously, so there is already a 
local cache of the JobConf
-  // needed by this RDD.
-  HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
-} else {
-  // Create a JobConf that will be cached and used across this RDD's 
getJobConf() calls in the
-  // local process. The local cache is accessed through 
HadoopRDD.putCachedMetadata().
-  // The caching helps minimize GC, since a JobConf can contain ~10KB 
of temporary objects.
-  // Synchronize to prevent ConcurrentModificationException 
(Spark-1097, Hadoop-10456).
-  HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
-val newJobConf = new JobConf(conf)
-initLocalJobConfFuncOpt.map(f => f(newJobConf))
-HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
-newJobConf
-  }
+  protected def createJobConf(): JobConf = {
+val conf: Configuration = serializableConf.value
--- End diff --

It is because RDD objects are not reused at all. Each task gets its own 
deserialized copy of the HadoopRDD and the conf.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2497] Included checks for module symbol...

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1463#issuecomment-50715047
  
Thanks @ScrapCodes I think I understand this all now. I'll merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

2014-07-30 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1648#discussion_r15626994
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -127,26 +110,15 @@ class HadoopRDD[K, V](
   private val createTime = new Date()
 
   // Returns a JobConf that will be used on slaves to obtain input splits 
for Hadoop reads.
-  protected def getJobConf(): JobConf = {
-val conf: Configuration = broadcastedConf.value.value
-if (conf.isInstanceOf[JobConf]) {
-  // A user-broadcasted JobConf was provided to the HadoopRDD, so 
always use it.
-  conf.asInstanceOf[JobConf]
-} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
-  // getJobConf() has been called previously, so there is already a 
local cache of the JobConf
-  // needed by this RDD.
-  HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
-} else {
-  // Create a JobConf that will be cached and used across this RDD's 
getJobConf() calls in the
-  // local process. The local cache is accessed through 
HadoopRDD.putCachedMetadata().
-  // The caching helps minimize GC, since a JobConf can contain ~10KB 
of temporary objects.
-  // Synchronize to prevent ConcurrentModificationException 
(Spark-1097, Hadoop-10456).
-  HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
-val newJobConf = new JobConf(conf)
-initLocalJobConfFuncOpt.map(f => f(newJobConf))
-HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
-newJobConf
-  }
+  protected def createJobConf(): JobConf = {
+val conf: Configuration = serializableConf.value
--- End diff --

Is this guaranteed to return a new copy of the conf for every partition or 
something? Because otherwise I'm not sure I see why we can safely remove the 
lock.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1678#issuecomment-50714957
  
QA tests have started for PR 1678. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17553/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 2017

2014-07-30 Thread carlosfuertes
GitHub user carlosfuertes opened a pull request:

https://github.com/apache/spark/pull/1682

Spark 2017

I address here issues SPARK-2017 and SPARK-2016 by serving the data for the 
tables as JSON further rendering under Spark UI web interface.

Main addition is exposing paths with the JSON data as:

/stages/stage/tasks/json/?id=nnn
/storage/json
/storage/rdd/workers/json?id=nnn
/storage/rdd/blocks/json?id=nnn

and using javascript to built the tables from an ajax request of those JSON.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/carlosfuertes/spark SPARK-2017

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1682.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1682


commit 458cae4ab065a0e173e0f9dd2c075addafabd9a7
Author: carlosfuertes 
Date:   2014-07-28T05:24:41Z

json output for storage/rdd

commit 40a49c36516a261936fd8d591ec864646ec3f1ae
Author: carlosfuertes 
Date:   2014-07-28T06:12:11Z

small refactoring to keep code dry

commit 7e6ce37a601f7b58ab25351c366f213a7e159cb7
Author: carlosfuertes 
Date:   2014-07-29T05:54:18Z

storage tab is using completely JSON to render all data

commit 3d5f1ddf210d07821e2ba6177a04f338972e7109
Author: carlosfuertes 
Date:   2014-07-29T05:55:08Z

Merge remote-tracking branch 'upstream/master' into SPARK-2017

commit 469320603e1aca5c10fc59ff3fe3443ad9f4c9ce
Author: carlosfuertes 
Date:   2014-07-29T06:08:56Z

made code pass style guidelines

commit f4ebcdbcca0ecefb4f4c990c35810a388c8e7bc0
Author: carlosfuertes 
Date:   2014-07-29T12:27:41Z

added expanding arrays in JSON for html tables into javascript function

commit f20560041468f1c0a59051cc960a94c58ebc138d
Author: carlosfuertes 
Date:   2014-07-30T02:29:56Z

Merge remote-tracking branch 'upstream/master' into SPARK-2017

commit 84d8628df35f01bc9ff818e0026a8cda6140bdf4
Author: carlosfuertes 
Date:   2014-07-30T05:11:21Z

adding tasks table under stage Spark UI from JSON. First pass

commit 6e787f60c670db77a2e1c7be3980ca7493b9ede9
Author: carlosfuertes 
Date:   2014-07-30T06:52:25Z

added sorttable custom key to json

commit 11e33836bc89932b1d5a4919394170e030329b57
Author: carlosfuertes 
Date:   2014-07-30T13:09:30Z

fixed sorttable to sort tables again

commit 7b0bde38371fe0e12d4cee52336ead6dff819d37
Author: carlosfuertes 
Date:   2014-07-31T02:31:47Z

Merge remote-tracking branch 'upstream/master' into SPARK-2017

commit 5aa4b283010a1eaaf48d09399ed16bf5aafb4726
Author: carlosfuertes 
Date:   2014-07-31T05:19:38Z

added a proper JSON handling of sorttable_custom key

commit 384ff1a567eadeefa72d28d5fedd9260a1874d3f
Author: carlosfuertes 
Date:   2014-07-31T05:22:27Z

deleted commented code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-30 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1639#issuecomment-50714857
  
Alright, I've merged this.  Thanks for the review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2497] Included checks for module symbol...

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1463#issuecomment-50714816
  
Hey @ScrapCodes sorry I was misunderstanding what you were saying. You are 
just trying to reflect on the sybmol as both a class and a module.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...

2014-07-30 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50714796
  
Were you able to test the performance characteristics of this versus the 
old stuff? Was this indeed the main cause of the live listener bus overflowing, 
or is that still a problem?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2171: Demonstrate and explain how to use...

2014-07-30 Thread Artjom-Metro
Github user Artjom-Metro closed the pull request at:

https://github.com/apache/spark/pull/1304


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2171: Demonstrate and explain how to use...

2014-07-30 Thread Artjom-Metro
Github user Artjom-Metro commented on the pull request:

https://github.com/apache/spark/pull/1304#issuecomment-50714666
  
Okay, I'm sorry for the delay. It took some time to get access to a server, 
I hope to publish the result within the next 2 weeks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

2014-07-30 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1648#issuecomment-50714647
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1639#issuecomment-50714462
  
QA results for PR 1639:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17550/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...

2014-07-30 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50714300
  
@mateiz We could also provide an dict interface as an fallback, so user can 
access these fields just like dict, such as row["pass"] ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2134: Report metrics before application ...

2014-07-30 Thread rahulsinghaliitd
Github user rahulsinghaliitd commented on a diff in the pull request:

https://github.com/apache/spark/pull/1076#discussion_r15626682
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -154,6 +154,8 @@ private[spark] class Master(
   }
 
   override def postStop() {
+masterMetricsSystem.report()
+applicationMetricsSystem.report()
--- End diff --

Staleness will depend on metric system configuration (polling period). 
Although the metrics being tracked here are not very critical especially when 
the Master/Worker are being killed. How about I leave the calls in just for 
completeness sake?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add cacheTable guide

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1681#issuecomment-50713584
  
QA results for PR 1681:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17549/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2664. Deal with `--conf` options in spar...

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1665#issuecomment-50713560
  
Looks good Sandy! One small question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2664. Deal with `--conf` options in spar...

2014-07-30 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1665#discussion_r15626537
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -184,7 +184,7 @@ object SparkSubmit {
   OptionAssigner(args.archives, YARN, CLIENT, sysProp = 
"spark.yarn.dist.archives"),
 
   // Yarn cluster only
-  OptionAssigner(args.name, YARN, CLUSTER, clOption = "--name", 
sysProp = "spark.app.name"),
--- End diff --

Why does this one change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1680#issuecomment-50713364
  
QA results for PR 1680:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17548/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2340] Resolve event logging and History...

2014-07-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1280


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2758] UnionRDD's UnionPartition should ...

2014-07-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1675


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Required AM memory is "amMem", not "args.amMem...

2014-07-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1494


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2134: Report metrics before application ...

2014-07-30 Thread rahulsinghaliitd
Github user rahulsinghaliitd commented on a diff in the pull request:

https://github.com/apache/spark/pull/1076#discussion_r15626384
  
--- Diff: core/src/main/scala/org/apache/spark/metrics/sink/Sink.scala ---
@@ -20,4 +20,5 @@ package org.apache.spark.metrics.sink
 private[spark] trait Sink {
   def start: Unit
   def stop: Unit
+  def report: Unit
--- End diff --

Ouch, my apologies. I should compiled this before pushing the new change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2134: Report metrics before application ...

2014-07-30 Thread rahulsinghaliitd
Github user rahulsinghaliitd commented on a diff in the pull request:

https://github.com/apache/spark/pull/1076#discussion_r15626358
  
--- Diff: core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala 
---
@@ -91,6 +91,10 @@ private[spark] class MetricsSystem private (val 
instance: String,
 sinks.foreach(_.stop)
   }
 
+  def report() {
--- End diff --

Ouch, my apologies. I should compiled this before pushing the new change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...

2014-07-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1677#issuecomment-50712517
  
QA tests have started for PR 1677. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17552/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...

2014-07-30 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1677#issuecomment-50712252
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   4   5   6   7   8   >