[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user freeman-lab commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15628373 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingRegression.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, DeveloperApi} +import org.apache.spark.streaming.dstream.DStream + +/** + * :: DeveloperApi :: + * StreamingRegression implements methods for training + * a linear regression model on streaming data, and using it + * for prediction on streaming data. + * + * This class takes as type parameters a GeneralizedLinearModel, + * and a GeneralizedLinearAlgorithm, making it easy to extend to construct + * streaming versions of arbitrary regression analyses. For example usage, + * see StreamingLinearRegressionWithSGD. + * + */ +@DeveloperApi +@Experimental +abstract class StreamingRegression[ +M <: GeneralizedLinearModel, +A <: GeneralizedLinearAlgorithm[M]] extends Logging { + + /** The model to be updated and used for prediction. */ + var model: M + + /** The algorithm to use for updating. */ + val algorithm: A + + /** Return the latest model. */ + def latest(): M = { +model + } + + /** + * Update the model by training on batches of data from a DStream. + * This operation registers a DStream for training the model, + * and updates the model based on every subsequent non-empty + * batch of data from the stream. + * + * @param data DStream containing labeled data + */ + def trainOn(data: DStream[LabeledPoint]) { +data.foreachRDD{ + rdd => +if (rdd.count() > 0) { + model = algorithm.run(rdd, model.weights) + logInfo("Model updated") +} +logInfo("Current model: weights, %s".format(model.weights.toString)) +logInfo("Current model: intercept, %s".format(model.intercept.toString)) --- End diff -- Yup, I noticed this. It could also work to call ``setIntercept(addIntercept=true)`` where the algorithm is defined (e.g. within ``StreamingLinearRegressionWithSGD``), and have a setter to control this. Estimating an intercept from scratch on each update should be well constrained because we'll be starting from the current weights. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] Hadoop table scan bug fixin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1669#issuecomment-50721673 QA results for PR 1669:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17556/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...
Github user jey commented on the pull request: https://github.com/apache/spark/pull/1680#issuecomment-50721572 I agree with this design; the preforking was basically vestigial and should have been removed. I'll review this PR later this week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user freeman-lab commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15628354 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionSuite.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.File + +import com.google.common.io.Files +import org.apache.commons.io.FileUtils +import org.scalatest.FunSuite +import org.apache.spark.SparkConf +import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext} +import org.apache.spark.mllib.util.{MLStreamingUtils, LinearDataGenerator, LocalSparkContext} + +import scala.collection.mutable.ArrayBuffer + +class StreamingLinearRegressionSuite extends FunSuite { + + // Assert that two values are equal within tolerance epsilon + def assertEqual(v1: Double, v2: Double, epsilon: Double) { +def errorMessage = v1.toString + " did not equal " + v2.toString +assert(math.abs(v1-v2) <= epsilon, errorMessage) + } + + // Assert that model predictions are correct + def validatePrediction(predictions: Seq[Double], input: Seq[LabeledPoint]) { +val numOffPredictions = predictions.zip(input).count { case (prediction, expected) => + // A prediction is off if the prediction is more than 0.5 away from expected value. + math.abs(prediction - expected.label) > 0.5 +} +// At least 80% of the predictions should be on. +assert(numOffPredictions < input.length / 5) + } + + // Test if we can accurately learn Y = 10*X1 + 10*X2 on streaming data + test("streaming linear regression parameter accuracy") { + +val conf = new SparkConf().setMaster("local").setAppName("streaming test") +val testDir = Files.createTempDir() +val numBatches = 10 +val ssc = new StreamingContext(conf, Seconds(1)) +val data = MLStreamingUtils.loadLabeledPointsFromText(ssc, testDir.toString) +val model = StreamingLinearRegressionWithSGD.start(numFeatures=2, numIterations=50) + +model.trainOn(data) + +ssc.start() + +// write data to a file stream +Thread.sleep(5000) --- End diff -- MIght not be =) I added it because I saw it in the streaming test suite for file writing, but without it both tests still pass fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user freeman-lab commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15628342 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionSuite.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.File + +import com.google.common.io.Files +import org.apache.commons.io.FileUtils +import org.scalatest.FunSuite +import org.apache.spark.SparkConf +import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext} +import org.apache.spark.mllib.util.{MLStreamingUtils, LinearDataGenerator, LocalSparkContext} + +import scala.collection.mutable.ArrayBuffer + +class StreamingLinearRegressionSuite extends FunSuite { --- End diff -- I wanted to do this but it doesn't seem to work. The first test always passes, but the second test immediately hits this error. Maybe because we're effectively starting multiple StreamingContexts from the same SparkContext? I'm trying to debug. @tdas ? java.lang.NullPointerException: at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:159) at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:66) at org.apache.spark.mllib.regression.StreamingLinearRegressionSuite$$anonfun$2.apply$mcV$sp(StreamingLinearRegressionSuite.scala:92) at org.apache.spark.mllib.regression.StreamingLinearRegressionSuite$$anonfun$2.apply(StreamingLinearRegressionSuite.scala:89) at org.apache.spark.mllib.regression.StreamingLinearRegressionSuite$$anonfun$2.apply(StreamingLinearRegressionSuite.scala:89) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1680#issuecomment-50721351 Why we had multiple listen process before? How about the performance of fork() in EC2? I met that fork() will take 200ms in Xen VM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: add cacheTable guide
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/1681#issuecomment-50721262 LGTM, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user freeman-lab commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15628281 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingRegression.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, DeveloperApi} +import org.apache.spark.streaming.dstream.DStream + +/** + * :: DeveloperApi :: + * StreamingRegression implements methods for training + * a linear regression model on streaming data, and using it + * for prediction on streaming data. + * + * This class takes as type parameters a GeneralizedLinearModel, + * and a GeneralizedLinearAlgorithm, making it easy to extend to construct + * streaming versions of arbitrary regression analyses. For example usage, + * see StreamingLinearRegressionWithSGD. + * + */ +@DeveloperApi +@Experimental +abstract class StreamingRegression[ +M <: GeneralizedLinearModel, +A <: GeneralizedLinearAlgorithm[M]] extends Logging { + + /** The model to be updated and used for prediction. */ + var model: M + + /** The algorithm to use for updating. */ + val algorithm: A + + /** Return the latest model. */ + def latest(): M = { +model + } + + /** + * Update the model by training on batches of data from a DStream. + * This operation registers a DStream for training the model, + * and updates the model based on every subsequent non-empty + * batch of data from the stream. + * + * @param data DStream containing labeled data + */ + def trainOn(data: DStream[LabeledPoint]) { +data.foreachRDD{ + rdd => +if (rdd.count() > 0) { + model = algorithm.run(rdd, model.weights) + logInfo("Model updated") --- End diff -- great idea! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1680#discussion_r15628262 --- Diff: python/pyspark/daemon.py --- @@ -176,13 +120,30 @@ def handle_sigchld(*args): try: while not should_exit(): try: -# Spark tells us to exit by closing stdin -if os.read(0, 512) == '': -shutdown() -except EnvironmentError as err: -if err.errno != EINTR: -shutdown() +ready_fds = select.select([0, listen_sock], [], [])[0] +except select.error as ex: +if ex[0] == 4: +continue +else: raise +if 0 in ready_fds: +# Spark told us to exit by closing stdin +shutdown() +if listen_sock in ready_fds: +sock, addr = listen_sock.accept() +# Launch a worker process +if os.fork() == 0: +listen_sock.close() +try: +worker(sock) +except: +traceback.print_exc() +os._exit(1) +else: +assert should_exit() --- End diff -- Why need this? It should exit after task finish, should_exit() will False, then it will set exit_flag to True, all process will die. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2752]spark sql cli should not exit when...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/1661#issuecomment-50721059 LGTM except some styling issues. Thanks for working on this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15628238 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -845,33 +842,15 @@ object DecisionTree extends Serializable with Logging { } } - if (leftTotalCount == 0) { -return new InformationGainStats(0, topImpurity, topImpurity, Double.MinValue, 1) - } - if (rightTotalCount == 0) { -return new InformationGainStats(0, topImpurity, Double.MinValue, topImpurity, 1) - } - - val leftImpurity = strategy.impurity.calculate(leftCounts, leftTotalCount) - val rightImpurity = strategy.impurity.calculate(rightCounts, rightTotalCount) - - val leftWeight = leftTotalCount / (leftTotalCount + rightTotalCount) - val rightWeight = rightTotalCount / (leftTotalCount + rightTotalCount) - - val gain = { -if (level > 0) { - impurity - leftWeight * leftImpurity - rightWeight * rightImpurity -} else { - impurity - leftWeight * leftImpurity - rightWeight * rightImpurity -} - } - val totalCount = leftTotalCount + rightTotalCount + if (totalCount == 0) { +// Return arbitrary prediction. +return new InformationGainStats(0, topImpurity, topImpurity, topImpurity, 0) + } // Sum of count for each label - val leftRightCounts: Array[Double] -= leftCounts.zip(rightCounts) - .map{case (leftCount, rightCount) => leftCount + rightCount} + val leftRightCounts: Array[Double] = leftCounts.zip(rightCounts).map { + case (leftCount, rightCount) => leftCount + rightCount } --- End diff -- It may be more consistent to write this way: ~~~ val leftRightCounts: Array[Double] = leftCounts.zip(rightCounts).map { case (leftCount, rightCount) => leftCount + rightCount } ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2752]spark sql cli should not exit when...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1661#discussion_r15628242 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala --- @@ -288,33 +288,36 @@ private[hive] class SparkSQLCLIDriver extends CliDriver with Logging { out.println(cmd) } - ret = driver.run(cmd).getResponseCode - if (ret != 0) { -driver.close() -return ret - } - - val res = new JArrayList[String]() - - if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_CLI_PRINT_HEADER)) { -// Print the column names. -Option(driver.getSchema.getFieldSchemas).map { fields => - out.println(fields.map(_.getName).mkString("\t")) + try { +ret = driver.run(cmd).getResponseCode +if (ret != 0) { + driver.close() + ãreturn ret } - } +val res = new JArrayList[String]() + +if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_CLI_PRINT_HEADER)) { +ã// Print the column names. +ãOption(driver.getSchema.getFieldSchemas).map { fields => + ãout.println(fields.map(_.getName).mkString("\t")) +ã} + ã} - try { while (!out.checkError() && driver.getResults(res)) { res.foreach(out.println) res.clear() } } catch { -case e:IOException => +case e: IOException => --- End diff -- I think we can simply change `IOException` to `Exception` and remove the next `case e: Exception` clause to make this more concise. Also, the `stringifyException(e)` call can be useful for diagnosing SQL statements/scripts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1680#discussion_r15628219 --- Diff: python/pyspark/daemon.py --- @@ -176,13 +120,30 @@ def handle_sigchld(*args): try: while not should_exit(): try: -# Spark tells us to exit by closing stdin -if os.read(0, 512) == '': -shutdown() -except EnvironmentError as err: -if err.errno != EINTR: -shutdown() +ready_fds = select.select([0, listen_sock], [], [])[0] --- End diff -- select() should have an timeout here, so after SIGTERM, it will not exit until events happen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2712 - Add a small note to maven doc tha...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1615#issuecomment-50720880 maybe it would make sense to just enhance the existing section and clearly show that you need to run "mvn package" and then "mvn test" with a similar code block to the one in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15628197 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -815,20 +822,10 @@ object DecisionTree extends Serializable with Logging { topImpurity: Double): InformationGainStats = { strategy.algo match { case Classification => - var classIndex = 0 - val leftCounts: Array[Double] = new Array[Double](numClasses) - val rightCounts: Array[Double] = new Array[Double](numClasses) - var leftTotalCount = 0.0 - var rightTotalCount = 0.0 - while (classIndex < numClasses) { -val leftClassCount = leftNodeAgg(featureIndex)(splitIndex)(classIndex) -val rightClassCount = rightNodeAgg(featureIndex)(splitIndex)(classIndex) -leftCounts(classIndex) = leftClassCount -leftTotalCount += leftClassCount -rightCounts(classIndex) = rightClassCount -rightTotalCount += rightClassCount -classIndex += 1 - } + val leftCounts: Array[Double] = leftNodeAgg(featureIndex)(splitIndex) + val rightCounts: Array[Double] = rightNodeAgg(featureIndex)(splitIndex) + var leftTotalCount = leftCounts.sum + var rightTotalCount = rightCounts.sum --- End diff -- Those two values could be `val`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2712 - Add a small note to maven doc tha...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1615#issuecomment-50720702 @rxin @javadba - I believe this is covered already in the maven docs: http://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15628159 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -612,27 +615,31 @@ object DecisionTree extends Serializable with Logging { agg(aggIndex + labelInt) = agg(aggIndex + labelInt) + 1 } -def updateBinForUnorderedFeature(nodeIndex: Int, featureIndex: Int, arr: Array[Double], -label: Double, agg: Array[Double], rightChildShift: Int) = { +def updateBinForUnorderedFeature( +nodeIndex: Int, +featureIndex: Int, +arr: Array[Double], +label: Double, +agg: Array[Double], +rightChildShift: Int) = { --- End diff -- Your comments in the PR description is very helpful to understand the code the indexing of `agg`. Do you mind copying them here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1812] [WIP] Scala 2.11 support
Github user avati commented on the pull request: https://github.com/apache/spark/pull/1649#issuecomment-50720340 I will split this up into multiple PRs. I will start with a small pull request for akka upgrade to 2.3. However as Sean mentioned, I guess we are stuck on the fundamental incompatibility with Hadoop 1. Either we will need to forego Hadoop 1 compatibility, or wait till a -shaded-protobuf version of akka artifact is released (like akka-2.2.3-shaded-protobuf from org.spark-project.akka), or temporarily forego compatibility and fixup pom.xml later if/when a shaded-protobuf artifact is released. I would like to hear opinions about this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Remove unnecessary Code from spark-shell.cmd
Github user techaddict closed the pull request at: https://github.com/apache/spark/pull/666 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Remove unnecessary Code from spark-shell.cmd
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/666#issuecomment-50719523 Close this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627927 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -598,9 +598,12 @@ object DecisionTree extends Serializable with Logging { // Find feature bins for all nodes at a level. val binMappedRDD = input.map(x => findBinsForLevel(x)) -def updateBinForOrderedFeature(arr: Array[Double], agg: Array[Double], nodeIndex: Int, -label: Double, featureIndex: Int) = { - +def updateBinForOrderedFeature( +arr: Array[Double], +agg: Array[Double], +nodeIndex: Int, +label: Double, +featureIndex: Int) = { --- End diff -- Could you help remove `=` since this function returns nothing? (Same for `updateBinForOrderedFeature`.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627893 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -48,11 +50,13 @@ object DecisionTreeRunner { case class Params( input: String = null, + dataFormat: String = null, algo: Algo = Classification, numClassesForClassification: Int = 2, --- End diff -- If `numClassesForClassification` is removed, we also need to remove it here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1740] [PySpark] kill the python worker
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1643#issuecomment-50718956 I will wait for your patch, and think about using PIDs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2098: All Spark processes should support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1256#issuecomment-50718809 QA tests have started for PR 1256. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17564/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1678#issuecomment-50718818 QA tests have started for PR 1678. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17565/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1740] [PySpark] kill the python worker
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1643#issuecomment-50718757 Good question, it's dangerous to mix threads and fork(), it may be cause dead lock in child process. But in this case, because of GIL, then fork() happens, monitor thread is blocked or sleeping or polling, they are thread safe, so it will be a problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627842 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -100,16 +111,57 @@ object DecisionTreeRunner { } def run(params: Params) { + val conf = new SparkConf().setAppName("DecisionTreeRunner") val sc = new SparkContext(conf) // Load training data and cache it. -val examples = MLUtils.loadLabeledPoints(sc, params.input).cache() +val origExamples = params.dataFormat match { + case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache() + case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass = true).cache() +} +// For classification, re-index classes if needed. +val (examples, numClasses) = params.algo match { + case Classification => { +// classCounts: class --> # examples in class +val classCounts = origExamples.map(_.label).countByValue --- End diff -- add `()` to `countByValue` because it is an action (that triggers I/O). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2098: All Spark processes should support...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1256#issuecomment-50718557 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627787 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -100,16 +111,57 @@ object DecisionTreeRunner { } def run(params: Params) { + val conf = new SparkConf().setAppName("DecisionTreeRunner") val sc = new SparkContext(conf) // Load training data and cache it. -val examples = MLUtils.loadLabeledPoints(sc, params.input).cache() +val origExamples = params.dataFormat match { + case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache() + case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass = true).cache() +} +// For classification, re-index classes if needed. +val (examples, numClasses) = params.algo match { + case Classification => { +// classCounts: class --> # examples in class +val classCounts = origExamples.map(_.label).countByValue +val numClasses = classCounts.size +// classIndex: class --> index in 0,...,numClasses-1 +val classIndex = { + if (classCounts.keySet != Set[Double](0.0, 1.0)) { +classCounts.keys.toList.sorted.zipWithIndex.toMap + } else { +Map[Double, Int]() + } +} +val examples = { + if (classIndex.isEmpty) { +origExamples + } else { +origExamples.map(lp => LabeledPoint(classIndex(lp.label), lp.features)) + } +} +println(s"numClasses = $numClasses.") +println(s"Per-class example fractions, counts:") +println(s"Class\tFrac\tCount") +classCounts.keys.toList.sorted.foreach(c => { + val frac = classCounts(c) / (0.0 + examples.count()) --- End diff -- `examples.count()` is the same as `classCounts.values.sum`. (This is inside the foreach loop.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1678#issuecomment-50718425 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1635#issuecomment-50718130 QA results for PR 1635:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17558/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1678#issuecomment-50718100 QA results for PR 1678:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17553/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1812] [WIP] Scala 2.11 support
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/1649#issuecomment-50718152 Hey Anand, This looks like a very nice effort !, It would be very convenient if you can break this pull request into smaller subtasks. For example upgrading akka can be one subtask, and so on. That way it is easier for us to review/test and then merge. If however you think that you will not have the time for that I am happy to do it for you ? We would like to have this pretty soon - breaking up the PR will make reviewing and testing really easy for us. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2098: All Spark processes should support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1256#issuecomment-50718143 QA results for PR 1256:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class SparkConf(loadDefaults: Boolean, fileName: Option[String])For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17555/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1671#issuecomment-50718111 QA results for PR 1671:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class HashingTF(val numFeatures: Int) extends Serializable {class IDF {class DocumentFrequencyAggregator extends Serializable {For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17557/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1598#issuecomment-50717628 This feature will slow down normal access (by attribute or position), so I did not put it in. User still can use position to access the field with special names (such as keywords). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/1578#issuecomment-50717700 localBlocksToFetch is a instance of ArrayBuffer, not concurrent queue and it is used from only one thread right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1598#issuecomment-50717656 QA tests have started for PR 1598. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17563/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: add cacheTable guide
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1681#issuecomment-50717598 QA tests have started for PR 1681. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17562/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627594 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -100,16 +111,57 @@ object DecisionTreeRunner { } def run(params: Params) { + val conf = new SparkConf().setAppName("DecisionTreeRunner") val sc = new SparkContext(conf) // Load training data and cache it. -val examples = MLUtils.loadLabeledPoints(sc, params.input).cache() +val origExamples = params.dataFormat match { + case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache() + case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass = true).cache() +} +// For classification, re-index classes if needed. +val (examples, numClasses) = params.algo match { + case Classification => { +// classCounts: class --> # examples in class +val classCounts = origExamples.map(_.label).countByValue +val numClasses = classCounts.size +// classIndex: class --> index in 0,...,numClasses-1 +val classIndex = { + if (classCounts.keySet != Set[Double](0.0, 1.0)) { +classCounts.keys.toList.sorted.zipWithIndex.toMap + } else { +Map[Double, Int]() + } +} +val examples = { + if (classIndex.isEmpty) { +origExamples + } else { +origExamples.map(lp => LabeledPoint(classIndex(lp.label), lp.features)) + } +} +println(s"numClasses = $numClasses.") +println(s"Per-class example fractions, counts:") +println(s"Class\tFrac\tCount") +classCounts.keys.toList.sorted.foreach(c => { --- End diff -- `classCounts.keys.toList.sorted` is used twice. In Spark, `foreach(c = {` is often written as `foreach { c =>` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2752]spark sql cli should not exit when...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1661#discussion_r15627597 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala --- @@ -288,33 +288,36 @@ private[hive] class SparkSQLCLIDriver extends CliDriver with Logging { out.println(cmd) } - ret = driver.run(cmd).getResponseCode - if (ret != 0) { -driver.close() -return ret - } - - val res = new JArrayList[String]() - - if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_CLI_PRINT_HEADER)) { -// Print the column names. -Option(driver.getSchema.getFieldSchemas).map { fields => - out.println(fields.map(_.getName).mkString("\t")) + try { +ret = driver.run(cmd).getResponseCode +if (ret != 0) { + driver.close() + ãreturn ret --- End diff -- And they cause compilation failure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627537 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -100,16 +111,57 @@ object DecisionTreeRunner { } def run(params: Params) { + val conf = new SparkConf().setAppName("DecisionTreeRunner") val sc = new SparkContext(conf) // Load training data and cache it. -val examples = MLUtils.loadLabeledPoints(sc, params.input).cache() +val origExamples = params.dataFormat match { + case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache() + case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass = true).cache() +} +// For classification, re-index classes if needed. +val (examples, numClasses) = params.algo match { + case Classification => { +// classCounts: class --> # examples in class +val classCounts = origExamples.map(_.label).countByValue +val numClasses = classCounts.size +// classIndex: class --> index in 0,...,numClasses-1 +val classIndex = { --- End diff -- Would `classIndexMap` sound better? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627527 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -100,16 +111,57 @@ object DecisionTreeRunner { } def run(params: Params) { + val conf = new SparkConf().setAppName("DecisionTreeRunner") val sc = new SparkContext(conf) // Load training data and cache it. -val examples = MLUtils.loadLabeledPoints(sc, params.input).cache() +val origExamples = params.dataFormat match { + case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache() + case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass = true).cache() +} +// For classification, re-index classes if needed. +val (examples, numClasses) = params.algo match { + case Classification => { +// classCounts: class --> # examples in class +val classCounts = origExamples.map(_.label).countByValue +val numClasses = classCounts.size +// classIndex: class --> index in 0,...,numClasses-1 +val classIndex = { + if (classCounts.keySet != Set[Double](0.0, 1.0)) { --- End diff -- `[Double]` is not necessary --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627524 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -100,16 +111,57 @@ object DecisionTreeRunner { } def run(params: Params) { + val conf = new SparkConf().setAppName("DecisionTreeRunner") val sc = new SparkContext(conf) // Load training data and cache it. -val examples = MLUtils.loadLabeledPoints(sc, params.input).cache() +val origExamples = params.dataFormat match { + case "dense" => MLUtils.loadLabeledPoints(sc, params.input).cache() + case "libsvm" => MLUtils.loadLibSVMFile(sc, params.input, multiclass = true).cache() --- End diff -- `multiclass` was deprecated (today). Should be safe to remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1680#issuecomment-50717026 @aarondav if you can take a look you'd be great to review this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627514 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -69,25 +73,32 @@ object DecisionTreeRunner { opt[Int]("maxDepth") .text(s"max depth of the tree, default: ${defaultParams.maxDepth}") .action((x, c) => c.copy(maxDepth = x)) - opt[Int]("numClassesForClassification") -.text(s"number of classes for classification, " - + s"default: ${defaultParams.numClassesForClassification}") -.action((x, c) => c.copy(numClassesForClassification = x)) opt[Int]("maxBins") .text(s"max number of bins, default: ${defaultParams.maxBins}") .action((x, c) => c.copy(maxBins = x)) + opt[Double]("fracTest") +.text(s"fraction of data to hold out for testing, default: ${defaultParams.fracTest}") +.action((x, c) => c.copy(fracTest = x)) arg[String]("") .text("input paths to labeled examples in dense format (label,f0 f1 f2 ...)") .required() .action((x, c) => c.copy(input = x)) + arg[String]("") --- End diff -- Maybe we can make data format an optional argument (default to LibSVM). One reason is that the dense format is deprecated in v1.1. We support either LIBSVM format or MLlib's own format `(label,[v0,v1,...])` or `(label,[i0,i1,...],[vi0,vi1,...])`. Also, it is a little weird to have this as the second required argument after input data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627518 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -69,25 +73,32 @@ object DecisionTreeRunner { opt[Int]("maxDepth") .text(s"max depth of the tree, default: ${defaultParams.maxDepth}") .action((x, c) => c.copy(maxDepth = x)) - opt[Int]("numClassesForClassification") -.text(s"number of classes for classification, " - + s"default: ${defaultParams.numClassesForClassification}") -.action((x, c) => c.copy(numClassesForClassification = x)) opt[Int]("maxBins") .text(s"max number of bins, default: ${defaultParams.maxBins}") .action((x, c) => c.copy(maxBins = x)) + opt[Double]("fracTest") +.text(s"fraction of data to hold out for testing, default: ${defaultParams.fracTest}") +.action((x, c) => c.copy(fracTest = x)) arg[String]("") .text("input paths to labeled examples in dense format (label,f0 f1 f2 ...)") .required() .action((x, c) => c.copy(input = x)) + arg[String]("") +.text("data format: dense/libsvm") +.required() +.action((x, c) => c.copy(dataFormat = x)) checkConfig { params => -if (params.algo == Classification && -(params.impurity == Gini || params.impurity == Entropy)) { - success -} else if (params.algo == Regression && params.impurity == Variance) { - success +if (params.fracTest < 0 || params.fracTest > 1) { + failure(s"fracTest ${params.fracTest} value incorrect; should be in [0,1].") --- End diff -- `(0, 1)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627517 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -69,25 +73,32 @@ object DecisionTreeRunner { opt[Int]("maxDepth") .text(s"max depth of the tree, default: ${defaultParams.maxDepth}") .action((x, c) => c.copy(maxDepth = x)) - opt[Int]("numClassesForClassification") -.text(s"number of classes for classification, " - + s"default: ${defaultParams.numClassesForClassification}") -.action((x, c) => c.copy(numClassesForClassification = x)) opt[Int]("maxBins") .text(s"max number of bins, default: ${defaultParams.maxBins}") .action((x, c) => c.copy(maxBins = x)) + opt[Double]("fracTest") +.text(s"fraction of data to hold out for testing, default: ${defaultParams.fracTest}") +.action((x, c) => c.copy(fracTest = x)) arg[String]("") .text("input paths to labeled examples in dense format (label,f0 f1 f2 ...)") .required() .action((x, c) => c.copy(input = x)) + arg[String]("") +.text("data format: dense/libsvm") +.required() +.action((x, c) => c.copy(dataFormat = x)) checkConfig { params => -if (params.algo == Classification && -(params.impurity == Gini || params.impurity == Entropy)) { - success -} else if (params.algo == Regression && params.impurity == Variance) { - success +if (params.fracTest < 0 || params.fracTest > 1) { --- End diff -- `... <= 0 || ... >= 1`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1673#discussion_r15627507 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -69,25 +73,32 @@ object DecisionTreeRunner { opt[Int]("maxDepth") .text(s"max depth of the tree, default: ${defaultParams.maxDepth}") .action((x, c) => c.copy(maxDepth = x)) - opt[Int]("numClassesForClassification") -.text(s"number of classes for classification, " - + s"default: ${defaultParams.numClassesForClassification}") -.action((x, c) => c.copy(numClassesForClassification = x)) opt[Int]("maxBins") .text(s"max number of bins, default: ${defaultParams.maxBins}") .action((x, c) => c.copy(maxBins = x)) + opt[Double]("fracTest") +.text(s"fraction of data to hold out for testing, default: ${defaultParams.fracTest}") +.action((x, c) => c.copy(fracTest = x)) arg[String]("") .text("input paths to labeled examples in dense format (label,f0 f1 f2 ...)") --- End diff -- If we take LIBSVM format, we should update the doc here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1639 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2497] Included checks for module symbol...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1463 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1673#issuecomment-50716977 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]Hash Outer Join
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1147#issuecomment-50716934 QA tests have started for PR 1147. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17560/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15627468 --- Diff: python/pyspark/mllib/regression.py --- @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase): True """ - class LinearRegressionWithSGD(object): @classmethod -def train(cls, data, iterations=100, step=1.0, - miniBatchFraction=1.0, initialWeights=None): -"""Train a linear regression model on the given data.""" +def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None, + intercept=False, miniBatchFraction=1.0, initialWeights=None): +""" +Train a linear regression model on the given data. + +@param data: The training data. +@param iterations:The number of iterations (default: 100). +@param step: The step parameter used in SGD + (default: 1.0). +@param regParam: The regularizer parameter (default: 1.0). +@param regType: The type of regularizer used for training + our model. + Allowed values: "l1" for using L1Updater, + "l2" for using + SquaredL2Updater, + "none" for no regularizer. + (default: None) --- End diff -- It may be better to set the default to `"none"` and map `None` to `"none"` in the implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15627480 --- Diff: python/pyspark/mllib/regression.py --- @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase): True """ - class LinearRegressionWithSGD(object): @classmethod -def train(cls, data, iterations=100, step=1.0, - miniBatchFraction=1.0, initialWeights=None): -"""Train a linear regression model on the given data.""" +def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None, + intercept=False, miniBatchFraction=1.0, initialWeights=None): +""" +Train a linear regression model on the given data. + +@param data: The training data. +@param iterations:The number of iterations (default: 100). +@param step: The step parameter used in SGD + (default: 1.0). +@param regParam: The regularizer parameter (default: 1.0). +@param regType: The type of regularizer used for training + our model. + Allowed values: "l1" for using L1Updater, + "l2" for using + SquaredL2Updater, + "none" for no regularizer. + (default: None) +@param intercept: Boolean parameter which indicates the use + or not of the augmented representation for + training data (i.e. whether bias features + are activated or not). +@param miniBatchFraction: Fraction of data to be used for each SGD + iteration. +@param initialWeights:The initial weights (default: None). +""" sc = data.context -train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD( -d._jrdd, iterations, step, miniBatchFraction, i) +if regType is None: +train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD( +d._jrdd, iterations, step, regParam, "none", intercept, miniBatchFraction, i) +elif regType == "l2" or regType == "l1" or regType == "none": +train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD( +d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i) +else: +raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " + + "using the following string values: [l1, l2, none].") return _regression_train_wrapper(sc, train_f, LinearRegressionModel, data, initialWeights) - --- End diff -- ditto: please keep this empty line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15627471 --- Diff: python/pyspark/mllib/regression.py --- @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase): True """ - class LinearRegressionWithSGD(object): @classmethod -def train(cls, data, iterations=100, step=1.0, - miniBatchFraction=1.0, initialWeights=None): -"""Train a linear regression model on the given data.""" +def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None, + intercept=False, miniBatchFraction=1.0, initialWeights=None): +""" +Train a linear regression model on the given data. + +@param data: The training data. +@param iterations:The number of iterations (default: 100). +@param step: The step parameter used in SGD + (default: 1.0). +@param regParam: The regularizer parameter (default: 1.0). +@param regType: The type of regularizer used for training + our model. + Allowed values: "l1" for using L1Updater, + "l2" for using + SquaredL2Updater, + "none" for no regularizer. + (default: None) +@param intercept: Boolean parameter which indicates the use + or not of the augmented representation for + training data (i.e. whether bias features + are activated or not). +@param miniBatchFraction: Fraction of data to be used for each SGD + iteration. +@param initialWeights:The initial weights (default: None). +""" sc = data.context -train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD( -d._jrdd, iterations, step, miniBatchFraction, i) +if regType is None: +train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD( --- End diff -- To avoid having the long command twice, you can use ~~~ if regType is None: regType = "none" if regType in {"l2", "l1", "none"}: train_f = ... else: raise ... ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15627451 --- Diff: python/pyspark/mllib/regression.py --- @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase): True """ - class LinearRegressionWithSGD(object): @classmethod -def train(cls, data, iterations=100, step=1.0, - miniBatchFraction=1.0, initialWeights=None): -"""Train a linear regression model on the given data.""" +def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None, --- End diff -- To be safe, we shouldn't change the order of arguments, we can append new arguments `regParam`, `regType`, and `intercept` at the end. So if user used `train(data, 100, 1.0, 1.0, np.zeros(n))`, it still works in the new version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15627458 --- Diff: python/pyspark/mllib/regression.py --- @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase): True """ - class LinearRegressionWithSGD(object): @classmethod -def train(cls, data, iterations=100, step=1.0, - miniBatchFraction=1.0, initialWeights=None): -"""Train a linear regression model on the given data.""" +def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None, + intercept=False, miniBatchFraction=1.0, initialWeights=None): +""" +Train a linear regression model on the given data. + +@param data: The training data. --- End diff -- It is not necessary to align the doc, especially when limited by 78 characters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15627440 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -247,16 +248,24 @@ class PythonMLLibAPI extends Serializable { dataBytesJRDD: JavaRDD[Array[Byte]], numIterations: Int, stepSize: Double, + regParam: Double, + regType: String, + intercept: Boolean, miniBatchFraction: Double, initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = { +val lrAlg = new LinearRegressionWithSGD() +lrAlg.setIntercept(intercept) +lrAlg.optimizer + .setNumIterations(numIterations) + .setRegParam(regParam) + .setStepSize(stepSize) +if (regType == "l2") + lrAlg.optimizer.setUpdater(new SquaredL2Updater) +else if (regType == "l1") + lrAlg.optimizer.setUpdater(new L1Updater) --- End diff -- It is safer to add ~~~ else if (regType != "none") throw IllegalArgumentException("...") ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15627450 --- Diff: python/pyspark/mllib/regression.py --- @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase): True """ - --- End diff -- We use two empty lines to separate methods in pyspark. (I don't know the exact reason ...) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1598#issuecomment-50716626 Sure, sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2766: ScalaReflectionSuite throw an lleg...
GitHub user witgo opened a pull request: https://github.com/apache/spark/pull/1683 SPARK-2766: ScalaReflectionSuite throw an llegalArgumentException in JDK 6 You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/spark SPARK-2766 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1683.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1683 commit ff0ec1fc188a81af5702deb82005578b8027c13b Author: GuoQiang Li Date: 2014-07-31T05:59:38Z ScalaReflectionSuite throw an llegalArgumentException in JDK 6 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2752]spark sql cli should not exit when...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1661#discussion_r15627250 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala --- @@ -288,33 +288,36 @@ private[hive] class SparkSQLCLIDriver extends CliDriver with Logging { out.println(cmd) } - ret = driver.run(cmd).getResponseCode - if (ret != 0) { -driver.close() -return ret - } - - val res = new JArrayList[String]() - - if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_CLI_PRINT_HEADER)) { -// Print the column names. -Option(driver.getSchema.getFieldSchemas).map { fields => - out.println(fields.map(_.getName).mkString("\t")) + try { +ret = driver.run(cmd).getResponseCode +if (ret != 0) { + driver.close() + ãreturn ret --- End diff -- Hey, line 295, 300 to 304 contain illegal *Chinese* full-width space characters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1671#issuecomment-50715814 QA tests have started for PR 1671. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17557/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1635#issuecomment-50715812 QA tests have started for PR 1635. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17558/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1671#discussion_r15627165 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors} +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * Inverse document frequency (IDF). + * The standard formulation is used: `idf = log((m + 1) / (d(t) + 1))`, where `m` is the total + * number of documents and `d(t)` is the number of documents that contain term `t`. + */ +@Experimental +class IDF { + + // TODO: Allow different IDF formulations. + + private var brzIdf: BDV[Double] = _ + + /** + * Computes the inverse document frequency. + * @param dataset an RDD of term frequency vectors + */ + def fit(dataset: RDD[Vector]): this.type = { +brzIdf = dataset.treeAggregate(new IDF.DocumentFrequencyAggregator)( + seqOp = (df, v) => df.add(v), + combOp = (df1, df2) => df1.merge(df2) +).idf() +this + } + + /** + * Computes the inverse document frequency. + * @param dataset a JavaRDD of term frequency vectors + */ + def fit(dataset: JavaRDD[Vector]): this.type = { --- End diff -- Yes. In the Java test suite, I put 'idf.fit(...).transform(...)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] Hadoop table scan bug fixin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1669#issuecomment-50715565 QA tests have started for PR 1669. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17556/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1635#issuecomment-50715523 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...
Github user sarutak commented on a diff in the pull request: https://github.com/apache/spark/pull/1578#discussion_r15627125 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala --- @@ -199,15 +199,22 @@ object BlockFetcherIterator { // Get the local blocks while remote blocks are being fetched. Note that it's okay to do // these all at once because they will just memory-map some files, so they won't consume // any memory that might exceed our maxBytesInFlight - for (id <- localBlocksToFetch) { -getLocalFromDisk(id, serializer) match { - case Some(iter) => { -// Pass 0 as size since it's not in flight -results.put(new FetchResult(id, 0, () => iter)) -logDebug("Got local block " + id) - } - case None => { -throw new BlockException(id, "Could not get block " + id + " from local machine") + var fetchIndex = 0 + try { +for (id <- localBlocksToFetch) { + + // getLocalFromDisk never return None but throws BlockException + val iter = getLocalFromDisk(id, serializer).get + // Pass 0 as size since it's not in flight + results.put(new FetchResult(id, 0, () => iter)) + fetchIndex += 1 + logDebug("Got local block " + id) +} + } catch { +case e: Exception => { + logError(s"Error occurred while fetching local blocks", e) + for (id <- localBlocksToFetch.drop(fetchIndex)) { +results.put(new FetchResult(id, -1, null)) --- End diff -- Thank you for your comment, @mateiz . > I wouldn't do drop and such on a ConcurrentQueue, since it might drop stuff other threads were adding. Just do a results.put on the failed block and don't worry about dropping other ones. You can actually move the try/catch into the for loop and add a "return" at the bottom of the catch after adding this failing FetchResult. But, if it returns from getLocalBlocks immediately rest of FetchResults is not set to results, and we waits on results.take() in next method forever right? results is a instance of LinkedBlockingQueue and take method is blocking method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1677#issuecomment-50715402 Added "closes" comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][SPARK-2212]Hash Outer Join
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1147#issuecomment-50715224 QA results for PR 1147:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):trait BinaryRepeatableIteratorNode extends BinaryNode {case class HashOuterJoin(For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17551/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2098: All Spark processes should support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1256#issuecomment-50715208 QA tests have started for PR 1256. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17555/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 2017
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1682#issuecomment-50715124 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] Hadoop table scan bug fixin...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1669#issuecomment-50715127 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1677#issuecomment-50715107 QA results for PR 1677:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17552/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1648#discussion_r15627025 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -127,26 +110,15 @@ class HadoopRDD[K, V]( private val createTime = new Date() // Returns a JobConf that will be used on slaves to obtain input splits for Hadoop reads. - protected def getJobConf(): JobConf = { -val conf: Configuration = broadcastedConf.value.value -if (conf.isInstanceOf[JobConf]) { - // A user-broadcasted JobConf was provided to the HadoopRDD, so always use it. - conf.asInstanceOf[JobConf] -} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) { - // getJobConf() has been called previously, so there is already a local cache of the JobConf - // needed by this RDD. - HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf] -} else { - // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the - // local process. The local cache is accessed through HadoopRDD.putCachedMetadata(). - // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects. - // Synchronize to prevent ConcurrentModificationException (Spark-1097, Hadoop-10456). - HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized { -val newJobConf = new JobConf(conf) -initLocalJobConfFuncOpt.map(f => f(newJobConf)) -HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf) -newJobConf - } + protected def createJobConf(): JobConf = { +val conf: Configuration = serializableConf.value --- End diff -- It is because RDD objects are not reused at all. Each task gets its own deserialized copy of the HadoopRDD and the conf. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2497] Included checks for module symbol...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1463#issuecomment-50715047 Thanks @ScrapCodes I think I understand this all now. I'll merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1648#discussion_r15626994 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -127,26 +110,15 @@ class HadoopRDD[K, V]( private val createTime = new Date() // Returns a JobConf that will be used on slaves to obtain input splits for Hadoop reads. - protected def getJobConf(): JobConf = { -val conf: Configuration = broadcastedConf.value.value -if (conf.isInstanceOf[JobConf]) { - // A user-broadcasted JobConf was provided to the HadoopRDD, so always use it. - conf.asInstanceOf[JobConf] -} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) { - // getJobConf() has been called previously, so there is already a local cache of the JobConf - // needed by this RDD. - HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf] -} else { - // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in the - // local process. The local cache is accessed through HadoopRDD.putCachedMetadata(). - // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary objects. - // Synchronize to prevent ConcurrentModificationException (Spark-1097, Hadoop-10456). - HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized { -val newJobConf = new JobConf(conf) -initLocalJobConfFuncOpt.map(f => f(newJobConf)) -HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf) -newJobConf - } + protected def createJobConf(): JobConf = { +val conf: Configuration = serializableConf.value --- End diff -- Is this guaranteed to return a new copy of the conf for every partition or something? Because otherwise I'm not sure I see why we can safely remove the lock. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1678#issuecomment-50714957 QA tests have started for PR 1678. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17553/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 2017
GitHub user carlosfuertes opened a pull request: https://github.com/apache/spark/pull/1682 Spark 2017 I address here issues SPARK-2017 and SPARK-2016 by serving the data for the tables as JSON further rendering under Spark UI web interface. Main addition is exposing paths with the JSON data as: /stages/stage/tasks/json/?id=nnn /storage/json /storage/rdd/workers/json?id=nnn /storage/rdd/blocks/json?id=nnn and using javascript to built the tables from an ajax request of those JSON. You can merge this pull request into a Git repository by running: $ git pull https://github.com/carlosfuertes/spark SPARK-2017 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1682.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1682 commit 458cae4ab065a0e173e0f9dd2c075addafabd9a7 Author: carlosfuertes Date: 2014-07-28T05:24:41Z json output for storage/rdd commit 40a49c36516a261936fd8d591ec864646ec3f1ae Author: carlosfuertes Date: 2014-07-28T06:12:11Z small refactoring to keep code dry commit 7e6ce37a601f7b58ab25351c366f213a7e159cb7 Author: carlosfuertes Date: 2014-07-29T05:54:18Z storage tab is using completely JSON to render all data commit 3d5f1ddf210d07821e2ba6177a04f338972e7109 Author: carlosfuertes Date: 2014-07-29T05:55:08Z Merge remote-tracking branch 'upstream/master' into SPARK-2017 commit 469320603e1aca5c10fc59ff3fe3443ad9f4c9ce Author: carlosfuertes Date: 2014-07-29T06:08:56Z made code pass style guidelines commit f4ebcdbcca0ecefb4f4c990c35810a388c8e7bc0 Author: carlosfuertes Date: 2014-07-29T12:27:41Z added expanding arrays in JSON for html tables into javascript function commit f20560041468f1c0a59051cc960a94c58ebc138d Author: carlosfuertes Date: 2014-07-30T02:29:56Z Merge remote-tracking branch 'upstream/master' into SPARK-2017 commit 84d8628df35f01bc9ff818e0026a8cda6140bdf4 Author: carlosfuertes Date: 2014-07-30T05:11:21Z adding tasks table under stage Spark UI from JSON. First pass commit 6e787f60c670db77a2e1c7be3980ca7493b9ede9 Author: carlosfuertes Date: 2014-07-30T06:52:25Z added sorttable custom key to json commit 11e33836bc89932b1d5a4919394170e030329b57 Author: carlosfuertes Date: 2014-07-30T13:09:30Z fixed sorttable to sort tables again commit 7b0bde38371fe0e12d4cee52336ead6dff819d37 Author: carlosfuertes Date: 2014-07-31T02:31:47Z Merge remote-tracking branch 'upstream/master' into SPARK-2017 commit 5aa4b283010a1eaaf48d09399ed16bf5aafb4726 Author: carlosfuertes Date: 2014-07-31T05:19:38Z added a proper JSON handling of sorttable_custom key commit 384ff1a567eadeefa72d28d5fedd9260a1874d3f Author: carlosfuertes Date: 2014-07-31T05:22:27Z deleted commented code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1639#issuecomment-50714857 Alright, I've merged this. Thanks for the review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2497] Included checks for module symbol...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1463#issuecomment-50714816 Hey @ScrapCodes sorry I was misunderstanding what you were saying. You are just trying to reflect on the sybmol as both a class and a module. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1679#issuecomment-50714796 Were you able to test the performance characteristics of this versus the old stuff? Was this indeed the main cause of the live listener bus overflowing, or is that still a problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2171: Demonstrate and explain how to use...
Github user Artjom-Metro closed the pull request at: https://github.com/apache/spark/pull/1304 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2171: Demonstrate and explain how to use...
Github user Artjom-Metro commented on the pull request: https://github.com/apache/spark/pull/1304#issuecomment-50714666 Okay, I'm sorry for the delay. It took some time to get access to a server, I hope to publish the result within the next 2 weeks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1648#issuecomment-50714647 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1639#issuecomment-50714462 QA results for PR 1639:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17550/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1598#issuecomment-50714300 @mateiz We could also provide an dict interface as an fallback, so user can access these fields just like dict, such as row["pass"] ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2134: Report metrics before application ...
Github user rahulsinghaliitd commented on a diff in the pull request: https://github.com/apache/spark/pull/1076#discussion_r15626682 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala --- @@ -154,6 +154,8 @@ private[spark] class Master( } override def postStop() { +masterMetricsSystem.report() +applicationMetricsSystem.report() --- End diff -- Staleness will depend on metric system configuration (polling period). Although the metrics being tracked here are not very critical especially when the Master/Worker are being killed. How about I leave the calls in just for completeness sake? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: add cacheTable guide
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1681#issuecomment-50713584 QA results for PR 1681:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17549/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2664. Deal with `--conf` options in spar...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1665#issuecomment-50713560 Looks good Sandy! One small question. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2664. Deal with `--conf` options in spar...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1665#discussion_r15626537 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -184,7 +184,7 @@ object SparkSubmit { OptionAssigner(args.archives, YARN, CLIENT, sysProp = "spark.yarn.dist.archives"), // Yarn cluster only - OptionAssigner(args.name, YARN, CLUSTER, clOption = "--name", sysProp = "spark.app.name"), --- End diff -- Why does this one change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] [SPARK-2764] Simplify daemon.py process ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1680#issuecomment-50713364 QA results for PR 1680:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17548/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2340] Resolve event logging and History...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1280 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2758] UnionRDD's UnionPartition should ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1675 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Required AM memory is "amMem", not "args.amMem...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1494 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2134: Report metrics before application ...
Github user rahulsinghaliitd commented on a diff in the pull request: https://github.com/apache/spark/pull/1076#discussion_r15626384 --- Diff: core/src/main/scala/org/apache/spark/metrics/sink/Sink.scala --- @@ -20,4 +20,5 @@ package org.apache.spark.metrics.sink private[spark] trait Sink { def start: Unit def stop: Unit + def report: Unit --- End diff -- Ouch, my apologies. I should compiled this before pushing the new change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2134: Report metrics before application ...
Github user rahulsinghaliitd commented on a diff in the pull request: https://github.com/apache/spark/pull/1076#discussion_r15626358 --- Diff: core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala --- @@ -91,6 +91,10 @@ private[spark] class MetricsSystem private (val instance: String, sinks.foreach(_.stop) } + def report() { --- End diff -- Ouch, my apologies. I should compiled this before pushing the new change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1677#issuecomment-50712517 QA tests have started for PR 1677. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17552/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1677#issuecomment-50712252 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---