[GitHub] spark pull request: remove not used test in src/main
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1397#issuecomment-48869513 That one already has a main method. Perhaps best to leave this here since it just starts a connection manager. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: remove not used test in src/main
Github user adrian-wang commented on the pull request: https://github.com/apache/spark/pull/1397#issuecomment-48869638 OK, I'll close this. Thank you Reynold! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: remove not used test in src/main
Github user adrian-wang closed the pull request at: https://github.com/apache/spark/pull/1397 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2467] Revert SparkBuild to publish-loca...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/1398 [SPARK-2467] Revert SparkBuild to publish-local to both .m2 and .ivy2. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-2467 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1398.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1398 commit 7f01d587259221c5333880578e6658e6534c52c4 Author: Takuya UESHIN ues...@happy-camper.st Date: 2014-07-14T06:28:44Z Revert SparkBuild to publish-local to both .m2 and .ivy2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL] Cherry picked Hive Thrift/JD...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/1399 [SPARK-2410][SQL] Cherry picked Hive Thrift/JDBC server JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410) Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc). (Thanks @chenghao-intel for his initial contribution of the Spark SQL CLI.) You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark thriftserver Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1399.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1399 commit b108e5035a68acfe8b5914468e64ddc7f392697a Author: Cheng Lian lian.cs@gmail.com Date: 2014-07-14T05:22:25Z Cherry picked the Hive Thrift server --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1946] Submit tasks after (configured ra...
Github user li-zhihui commented on the pull request: https://github.com/apache/spark/pull/900#issuecomment-48869936 @tgravescs add a commit according to comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL] Cherry picked Hive Thrift/JD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1399#issuecomment-48869960 QA tests have started for PR 1399. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16615/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2467] Revert SparkBuild to publish-loca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1398#issuecomment-48869958 QA tests have started for PR 1398. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16616/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL] Cherry picked Hive Thrift/JD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1399#issuecomment-48869966 QA results for PR 1399:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass HiveThriftServer2(hiveContext: HiveContext)brclass SparkSQLCLIDriver extends CliDriver with Logging {brclass SparkSQLCLIService(hiveContext: HiveContext)brclass SparkSQLDriver(val context: HiveContext = SparkSQLEnv.hiveContext)brclass SparkSQLSessionManager(hiveContext: HiveContext)brclass SparkSQLOperationManager(hiveContext: HiveContext) extends OperationManager with Logging {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16615/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/1385#discussion_r14866324 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -552,17 +552,10 @@ class SparkContext(config: SparkConf) extends Logging { valueClass: Class[V], minPartitions: Int = defaultMinPartitions ): RDD[(K, V)] = { -// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it. -val confBroadcast = broadcast(new SerializableWritable(hadoopConfiguration)) -val setInputPathsFunc = (jobConf: JobConf) = FileInputFormat.setInputPaths(jobConf, path) --- End diff -- there is no extra reason for broascast Configuration. I think it is because the old api here Instantiate HadoopRDD directly and HadoopRDD's construct method need broadcastedConf: Broadcast[SerializableWritable[Configuration]], so here broadcasted the Configuration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/1385#discussion_r14866367 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -128,25 +123,13 @@ class HadoopRDD[K, V]( // Returns a JobConf that will be used on slaves to obtain input splits for Hadoop reads. protected def getJobConf(): JobConf = { -val conf: Configuration = broadcastedConf.value.value -if (conf.isInstanceOf[JobConf]) { - // A user-broadcasted JobConf was provided to the HadoopRDD, so always use it. - conf.asInstanceOf[JobConf] -} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) { +val conf: JobConf = broadcastedConf.value.value +if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) { --- End diff -- yeah, there is no need to cache the jobconf if it is in broadcast --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1385#discussion_r14866389 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -552,17 +552,10 @@ class SparkContext(config: SparkConf) extends Logging { valueClass: Class[V], minPartitions: Int = defaultMinPartitions ): RDD[(K, V)] = { -// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it. -val confBroadcast = broadcast(new SerializableWritable(hadoopConfiguration)) -val setInputPathsFunc = (jobConf: JobConf) = FileInputFormat.setInputPaths(jobConf, path) --- End diff -- There is. See my comment below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48871282 @andrewor14 the task id is not logged twice. One is TID, which is used to correlate with the executors (because executors only know about TID's), and the other is the task index, which is the index of the task within a stage. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL] Cherry picked Hive Thrift/JD...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/1399#issuecomment-48871338 Hey @pwendell, would you mind help reviewing my changes to the POM and SBT files? Not 100% sure about the correctness. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1056#discussion_r14866901 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -37,8 +36,15 @@ import org.apache.spark._ import org.apache.spark.executor.TaskMetrics import org.apache.spark.partial.{ApproximateActionListener, ApproximateEvaluator, PartialResult} import org.apache.spark.rdd.RDD -import org.apache.spark.storage.{BlockId, BlockManager, BlockManagerMaster, RDDBlockId} -import org.apache.spark.util.{CallSite, SystemClock, Clock, Utils} +import org.apache.spark.storage._ +import org.apache.spark.util.{SystemClock, Clock, Utils} +import scala.Some +import org.apache.spark.FetchFailed +import org.apache.spark.storage.RDDBlockId +import org.apache.spark.util.CallSite +import akka.actor.OneForOneStrategy +import org.apache.spark.ExceptionFailure +import org.apache.spark.storage.BlockManagerMessages.BlockManagerHeartbeat --- End diff -- Oops my IDE is doing this for some reason --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/1385#discussion_r14866988 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -206,17 +202,10 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon tableDesc: TableDesc, path: String, inputFormatClass: Class[InputFormat[Writable, Writable]]): RDD[Writable] = { - -val initializeJobConfFunc = HadoopTableReader.initializeLocalJobConfFunc(path, tableDesc) _ - -val rdd = new HadoopRDD( - sc.sparkContext, - _broadcastedHiveConf.asInstanceOf[Broadcast[SerializableWritable[Configuration]]], - Some(initializeJobConfFunc), - inputFormatClass, - classOf[Writable], - classOf[Writable], - _minSplitsPerRDD) +val jobConf = new JobConf(_broadcastedHiveConf.value.value.asInstanceOf[Configuration]) --- End diff -- My implements is broadcasting the conf for each HadoopRDD, do you mean we just broadcast once for all Hadoop RDD ? right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark REEF Module
GitHub user deadmau opened a pull request: https://github.com/apache/spark/pull/1400 Spark REEF Module Added reef module that allows Spark to run on top of [REEF](http://www.reef-project.org). Modified: - SparkContext.scala, pom.xml Added: - reef Module For REEF side module, see this [link](https://github.com/cmssnu/spark_on_reef/tree/reef-0.1). You can merge this pull request into a Git repository by running: $ git pull https://github.com/cmssnu/spark reef-0.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1400.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1400 commit 7515367e361c910ea81bae65e42e32a5a6763a5e Author: Takuya UESHIN ues...@happy-camper.st Date: 2014-05-15T18:20:21Z [SPARK-1845] [SQL] Use AllScalaRegistrar for SparkSqlSerializer to register serializers of ... ...Scala collections. When I execute `orderBy` or `limit` for `SchemaRDD` including `ArrayType` or `MapType`, `SparkSqlSerializer` throws the following exception: ``` com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.$colon$colon ``` or ``` com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.Vector ``` or ``` com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap ``` and so on. This is because registrations of serializers for each concrete collections are missing in `SparkSqlSerializer`. I believe it should use `AllScalaRegistrar`. `AllScalaRegistrar` covers a lot of serializers for concrete classes of `Seq`, `Map` for `ArrayType`, `MapType`. Author: Takuya UESHIN ues...@happy-camper.st Closes #790 from ueshin/issues/SPARK-1845 and squashes the following commits: d1ed992 [Takuya UESHIN] Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections. (cherry picked from commit db8cc6f28abe4326cea6f53feb604920e4867a27) Signed-off-by: Reynold Xin r...@apache.org commit f9eeddccbd42064f5d1234b323ac74bb2a39e0aa Author: Takuya UESHIN ues...@happy-camper.st Date: 2014-05-15T18:21:33Z [SPARK-1819] [SQL] Fix GetField.nullable. `GetField.nullable` should be `true` not only when `field.nullable` is `true` but also when `child.nullable` is `true`. Author: Takuya UESHIN ues...@happy-camper.st Closes #757 from ueshin/issues/SPARK-1819 and squashes the following commits: 8781a11 [Takuya UESHIN] Modify a test to use named parameters. 5bfc77d [Takuya UESHIN] Fix GetField.nullable. (cherry picked from commit 94c9d6f59859ebc77fae112c2c42c64b7a4d7f83) Signed-off-by: Reynold Xin r...@apache.org commit bc9a96e2e97d4a9b4a2075fb026be320b96bd08b Author: Xiangrui Meng m...@databricks.com Date: 2014-05-15T18:59:59Z [SPARK-1741][MLLIB] add predict(JavaRDD) to RegressionModel, ClassificationModel, and KMeans `model.predict` returns a RDD of Scala primitive type (Int/Double), which is recognized as Object in Java. Adding predict(JavaRDD) could make life easier for Java users. Added tests for KMeans, LinearRegression, and NaiveBayes. Will update examples after https://github.com/apache/spark/pull/653 gets merged. cc: @srowen Author: Xiangrui Meng m...@databricks.com Closes #670 from mengxr/predict-javardd and squashes the following commits: b77ccd8 [Xiangrui Meng] Merge branch 'master' into predict-javardd 43caac9 [Xiangrui Meng] add predict(JavaRDD) to RegressionModel, ClassificationModel, and KMeans (cherry picked from commit d52761d67f42ad4d2ff02d96f0675fb3ab709f38) Signed-off-by: Patrick Wendell pwend...@gmail.com commit 35870574a6e33a39c139139c8739a82796af5ebb Author: Sandy Ryza sa...@cloudera.com Date: 2014-05-15T23:35:39Z SPARK-1851. Upgrade Avro dependency to 1.7.6 so Spark can read Avro file... ...s Author: Sandy Ryza sa...@cloudera.com Closes #795 from sryza/sandy-spark-1851 and squashes the following commits: 79c8227 [Sandy Ryza] SPARK-1851. Upgrade Avro dependency to 1.7.6 so Spark can read Avro files (cherry picked from commit 08e7606a964e3d1ac1d565f33651ff0035c75044) Signed-off-by: Patrick Wendell pwend...@gmail.com commit 22f261a1a3efbd466ca0588cc77beb92fb14b6a3 Author: Stevo SlaviÄ ssla...@gmail.com Date: 2014-05-15T23:44:14Z SPARK-1803 Replaced colon in filenames with a dash This patch replaces colon in several filenames with dash to make these filenames Windows compatible.
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1056#discussion_r14867043 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala --- @@ -52,25 +52,24 @@ class BlockManagerMasterActor(val isLocal: Boolean, conf: SparkConf, listenerBus private val akkaTimeout = AkkaUtils.askTimeout(conf) - val slaveTimeout = conf.get(spark.storage.blockManagerSlaveTimeoutMs, - + (BlockManager.getHeartBeatFrequency(conf) * 3)).toLong + val slaveTimeout = conf.getLong(spark.storage.blockManagerSlaveTimeoutMs, +(conf.getInt(spark.executor.heartbeatInterval, 2000) * 3)) --- End diff -- Any reason that would be better? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1385#discussion_r14867102 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -206,17 +202,10 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon tableDesc: TableDesc, path: String, inputFormatClass: Class[InputFormat[Writable, Writable]]): RDD[Writable] = { - -val initializeJobConfFunc = HadoopTableReader.initializeLocalJobConfFunc(path, tableDesc) _ - -val rdd = new HadoopRDD( - sc.sparkContext, - _broadcastedHiveConf.asInstanceOf[Broadcast[SerializableWritable[Configuration]]], - Some(initializeJobConfFunc), - inputFormatClass, - classOf[Writable], - classOf[Writable], - _minSplitsPerRDD) +val jobConf = new JobConf(_broadcastedHiveConf.value.value.asInstanceOf[Configuration]) --- End diff -- Yes. With our current implementation, each Hive partition in Spark SQL creates one HadoopRDD. We absolutely cannot afford broadcasting the conf for each HadoopRDD/Hive partition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark REEF Module
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1400#issuecomment-48872835 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark REEF Module
Github user deadmau closed the pull request at: https://github.com/apache/spark/pull/1400 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1354#issuecomment-48872919 Thanks. Merging this in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1354 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1056#discussion_r14867152 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala --- @@ -129,7 +128,7 @@ class BlockManagerMasterActor(val isLocal: Boolean, conf: SparkConf, listenerBus case ExpireDeadHosts = expireDeadHosts() -case HeartBeat(blockManagerId) = +case BlockManagerHeartbeat(blockManagerId) = --- End diff -- The origin of this is still from executors. The same two purposes remain - let the blockmanager master know that the executor's is still alive and find out whether the executor's blockmanager should reregister. It just gets passed along inside the generic executor-driver instead of as its own message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/1056#discussion_r14867310 --- Diff: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import akka.actor.Actor +import org.apache.spark.executor.TaskMetrics +import org.apache.spark.storage.BlockManagerId +import org.apache.spark.scheduler.TaskScheduler + +case class Heartbeat(executorId: String, taskMetrics: Array[(Long, TaskMetrics)], + blockManagerId: BlockManagerId) extends Serializable + +class HeartbeatReceiver(scheduler: TaskScheduler) extends Actor { + override def receive = { +case Heartbeat(executorId, taskMetrics, blockManagerId) = + sender ! scheduler.executorHeartbeat(executorId, taskMetrics, blockManagerId) --- End diff -- This is my first time dealing with Akka, so let me know if I'm missing something, but I went this route because the BlockManagerMasterActor needs to receive the event, and nobody is able to get a direct reference to it. It's only accessible through an ActorRef. My understanding of the Actor pattern is that Actors should only be communicated with via messages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/1056#issuecomment-48873857 Thanks for the feedback @andrewor14 . Will upmerge and upload a patch with your changes, except for where I had questions above. Do you know what the easiest way to check the size of an Akka message is? Any pointers on where to look for how to deal with the Akka frame size issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...
Github user lirui-intel commented on the pull request: https://github.com/apache/spark/pull/1212#issuecomment-48873878 Hi @mridulm , I've added some test case to capture schedule behavior of RACK_LOCAL tasks. Let me know if I got anything wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2467] Revert SparkBuild to publish-loca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1398#issuecomment-48875578 QA results for PR 1398:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16616/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/1385#discussion_r14868164 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -206,17 +202,10 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon tableDesc: TableDesc, path: String, inputFormatClass: Class[InputFormat[Writable, Writable]]): RDD[Writable] = { - -val initializeJobConfFunc = HadoopTableReader.initializeLocalJobConfFunc(path, tableDesc) _ - -val rdd = new HadoopRDD( - sc.sparkContext, - _broadcastedHiveConf.asInstanceOf[Broadcast[SerializableWritable[Configuration]]], - Some(initializeJobConfFunc), - inputFormatClass, - classOf[Writable], - classOf[Writable], - _minSplitsPerRDD) +val jobConf = new JobConf(_broadcastedHiveConf.value.value.asInstanceOf[Configuration]) --- End diff -- get it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: move some test file to match src code
GitHub user adrian-wang opened a pull request: https://github.com/apache/spark/pull/1401 move some test file to match src code Just move some test suite to corresponding package You can merge this pull request into a Git repository by running: $ git pull https://github.com/adrian-wang/spark movetestfiles Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1401.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1401 commit d1a680307b690b8658541204d5c9ed5f9abbee07 Author: Daoyuan daoyuan.w...@intel.com Date: 2014-07-14T08:25:41Z move some test file to match src code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: move some test file to match src code
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1401#issuecomment-48876149 QA tests have started for PR 1401. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16617/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1402 [SPARK-2471] remove runtime scope for jets3t The assembly jar doesn't include jets3t if we set it to runtime only, but I don't know whether it was set this way for a particular reason. CC: @srowen @ScrapCodes You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark jets3t Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1402.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1402 commit bfa2d175aedb5db21901d0a71bd5dcf3b9ea0903 Author: Xiangrui Meng m...@databricks.com Date: 2014-07-14T08:42:15Z remove runtime scope for jets3t --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48877612 QA tests have started for PR 1402. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16618/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48877650 It's set that way just because it is only used by Hadoop's FileSystem to access S3. Code shouldn't call it directly. Maven should therefore include it in the runtime classpath and assembled jars, but not the compile classpath. I seem to distantly recall that sbt doesn't quite do this. I hope there is an equivalent for sbt as that's the right thing to do. It's not the end of the world to make jets3t appear on the compile classpath. (So I assume the sbt assembly still has to be supported? the Maven one should be OK in this regard) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48878585 I see. sbt-assembly doesn't pick up runtime only jars by default or maybe it doesn't read pom correctly. @ScrapCodes Do you know whether we can tell sbt-assembly to include runtime jars? It is weird that it doesn't do that by default or maybe it is the sbt-pom-reader's problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48880738 Hi, I just checked it is correctly included by doing `show runtime:full-classpath`. Have to check why it is not in assembly. Prashant Sharma On Mon, Jul 14, 2014 at 2:35 PM, Xiangrui Meng notificati...@github.com wrote: I see. sbt-assembly doesn't pick up runtime only jars by default or maybe it doesn't read pom correctly. @ScrapCodes https://github.com/ScrapCodes Do you know whether we can tell sbt-assembly to include runtime jars? It is weird that it doesn't do that by default or maybe it is the sbt-pom-reader's problem. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1402#issuecomment-48878585. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...
Github user avulanov commented on a diff in the pull request: https://github.com/apache/spark/pull/1155#discussion_r14870864 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.evaluation + +import org.apache.spark.Logging +import org.apache.spark.SparkContext._ +import org.apache.spark.annotation.Experimental +import org.apache.spark.mllib.linalg.{Matrices, Matrix} +import org.apache.spark.rdd.RDD + +import scala.collection.Map + +/** + * ::Experimental:: + * Evaluator for multiclass classification. + * + * @param predictionAndLabels an RDD of (prediction, label) pairs. + */ +@Experimental +class MulticlassMetrics(predictionAndLabels: RDD[(Double, Double)]) { + + private lazy val labelCountByClass: Map[Double, Long] = predictionAndLabels.values.countByValue() + private lazy val labelCount: Long = labelCountByClass.values.sum + private lazy val tpByClass: Map[Double, Int] = predictionAndLabels +.map { case (prediction, label) = + (label, if (label == prediction) 1 else 0) +}.reduceByKey(_ + _) +.collectAsMap() + private lazy val fpByClass: Map[Double, Int] = predictionAndLabels +.map { case (prediction, label) = + (prediction, if (prediction != label) 1 else 0) +}.reduceByKey(_ + _) +.collectAsMap() + private lazy val confusions = predictionAndLabels.map { +case (prediction, label) = ((prediction, label), 1) + }.reduceByKey(_ + _).collectAsMap() + + /** + * Returns confusion matrix: + * predicted classes are in columns, + * they are ordered by class label ascending, + * as in labels + */ + lazy val confusionMatrix: Matrix = { +val transposedMatrix = Array.ofDim[Double](labels.size, labels.size) +for (i - 0 to labels.size - 1; j - 0 to labels.size - 1) { + transposedMatrix(i)(j) = confusions.getOrElse((labels(i), labels(j)), 0).toDouble +} +val flatMatrix = transposedMatrix.flatMap(arr = arr) --- End diff -- I removed the intermediate matrix. Could you clarify what do you want me to do with for loop? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/1155#issuecomment-48882671 @mengxr I addressed your comments, except the one above which I commented. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1155#issuecomment-48882753 QA tests have started for PR 1155. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16619/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...
GitHub user witgo opened a pull request: https://github.com/apache/spark/pull/1403 [WIP][SQL] By default does not run hive compatibility tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/spark hive_compatibility Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1403.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1403 commit a6643165ce2a21cdbf5cad702bc4922308af5088 Author: witgo wi...@qq.com Date: 2014-07-14T09:53:24Z The default does not run hive compatibility tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1403#issuecomment-48883582 QA tests have started for PR 1403. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16620/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48883641 It is actually intended that way, assembly is not intended as a replacement for runtime classpath. One way to go about is to not have it as runtime dependency. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: move some test file to match src code
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1401#issuecomment-48883713 QA results for PR 1401:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass BroadcastSuite extends FunSuite with LocalSparkContext {brclass ConnectionManagerSuite extends FunSuite {brclass PipedRDDSuite extends FunSuite with SharedSparkContext {brclass ZippedPartitionsSuite extends FunSuite with SharedSparkContext {brclass AkkaUtilsSuite extends FunSuite with LocalSparkContext {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16617/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48883754 What is the assembly then? I always took this term to mean all of the runtime dependencies together. How would I make a runnable JAR in SBT in general? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48884485 assembly should just include compile scope. Take for example the case of slf4j where end user can bring in any of the log4j or logback. So they are supposed to be at runtime classpath. And putting them in assembly jar will kill the purpose of having this pluggability at runtime. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48885032 The SLF4J binding is a runtime dependency, and it may be one that a down-stream consumer wants to override. But leaving it out entirely yields no runtime binding at all -- no logging. There is an argument for not including any SLF4J runtime dependency in Spark at all, and which requires users to add their own to get any log messages. Debatable -- but a different question. SLF4J is a bit of a funny example though. What about other runtime dependencies? jets3t is a good one. It has to be on the runtime classpath, so needs to go into a runtime assembly JAR. It does not need to appear on the compile classpath, and shouldn't ideally. This is not the same as provided scope, where the user is expected to bring the artifact from the local environment. What's the standard SBT magic for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48885040 QA results for PR 1402:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16618/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1155#issuecomment-48890147 QA results for PR 1155:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass MulticlassMetrics(predictionAndLabels: RDD[(Double, Double)]) {br* (equals to precision for multiclass classifierbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16619/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1403#issuecomment-48890757 QA results for PR 1403:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16620/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1403#issuecomment-48894943 QA tests have started for PR 1403. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16621/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1403#issuecomment-48896247 QA tests have started for PR 1403. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16622/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Remove NOTE: SPARK_YARN is deprecated, please...
GitHub user witgo opened a pull request: https://github.com/apache/spark/pull/1404 Remove NOTE: SPARK_YARN is deprecated, please use -Pyarn flag You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/spark run-tests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1404.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1404 commit 04dd78e93be9371925424101e3a4722e0b558715 Author: witgo wi...@qq.com Date: 2014-07-14T14:29:14Z Remove NOTE: SPARK_YARN is deprecated, please use -Pyarn flag --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1403#issuecomment-48907275 QA results for PR 1403:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16622/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Remove NOTE: SPARK_YARN is deprecated, please...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1404#issuecomment-48907508 QA tests have started for PR 1404. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16623/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Remove NOTE: SPARK_YARN is deprecated, please...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1404#issuecomment-48920976 QA results for PR 1404:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16623/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2154] Schedule next Driver when one com...
GitHub user aarondav opened a pull request: https://github.com/apache/spark/pull/1405 [SPARK-2154] Schedule next Driver when one completes (standalone mode) You can merge this pull request into a Git repository by running: $ git pull https://github.com/aarondav/spark 2154 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1405.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1405 commit 24e9ef9c1cf0acfd62330c4e311cd47af85ea4a3 Author: Aaron Davidson aa...@databricks.com Date: 2014-07-14T16:20:26Z [SPARK-2154] Schedule next Driver when one completes (standalone mode) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2154] Schedule next Driver when one com...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1405#issuecomment-48922358 QA tests have started for PR 1405. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16624/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/1406 [SPARK-2474][SQL] For a registered table in OverrideCatalog, the Analyzer failed to resolve references in the format of tableName.fieldName Please refer to JIRA (https://issues.apache.org/jira/browse/SPARK-2474) for how to reproduce the problem and my understanding of the root cause. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark SPARK-2474 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1406.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1406 commit a5c2145a57f20b98bd34a7de16b0c697b7913ec9 Author: Yin Huai h...@cse.ohio-state.edu Date: 2014-07-14T16:40:05Z Support sql/console. commit 568de1f92fcad7ab3293941106adf7fd943c5bed Author: Yin Huai h...@cse.ohio-state.edu Date: 2014-07-14T16:40:17Z Wrap the relation in a Subquery named by the table name in OverrideCatalog.lookupRelation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48925260 Task IDâ is globally unqiue, index is unique within a stage --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1056#discussion_r14888926 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala --- @@ -52,25 +52,24 @@ class BlockManagerMasterActor(val isLocal: Boolean, conf: SparkConf, listenerBus private val akkaTimeout = AkkaUtils.askTimeout(conf) - val slaveTimeout = conf.get(spark.storage.blockManagerSlaveTimeoutMs, - + (BlockManager.getHeartBeatFrequency(conf) * 3)).toLong + val slaveTimeout = conf.getLong(spark.storage.blockManagerSlaveTimeoutMs, +(conf.getInt(spark.executor.heartbeatInterval, 2000) * 3)) --- End diff -- No huge reason, just that it avoids nesting the `conf.get`s and makes it easier to see what the final value is (imagine if you had 5 more gets). Though I just noticed that here you can't do that easily because of the `* 3`, so scratch that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1402#issuecomment-48927532 @ScrapCodes I agree with @srowen that the assembly jar should include runtime libraries except those marked provided. If we cannot find a way to let sbt understand maven runtime scope, we should merge this change because without jets3t all S3 operations are broken. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...
Github user yhuai closed the pull request at: https://github.com/apache/spark/pull/1406 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...
GitHub user yhuai reopened a pull request: https://github.com/apache/spark/pull/1406 [SPARK-2474][SQL] For a registered table in OverrideCatalog, the Analyzer failed to resolve references in the format of tableName.fieldName Please refer to JIRA (https://issues.apache.org/jira/browse/SPARK-2474) for how to reproduce the problem and my understanding of the root cause. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark SPARK-2474 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1406.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1406 commit a5c2145a57f20b98bd34a7de16b0c697b7913ec9 Author: Yin Huai h...@cse.ohio-state.edu Date: 2014-07-14T16:40:05Z Support sql/console. commit c43ad00b8a07c168c56f88f632684c6fec5cd719 Author: Yin Huai h...@cse.ohio-state.edu Date: 2014-07-14T17:17:29Z Wrap the relation in a Subquery named by the table name in OverrideCatalog.lookupRelation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1406#issuecomment-48929018 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1406#issuecomment-48929426 QA tests have started for PR 1406. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16626/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1155#issuecomment-48930159 @avulanov In Scala, for is slower than while. See https://issues.scala-lang.org/browse/SI-1338 for example. So please replace the for loop with two while loops in your implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1155#discussion_r14890904 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.evaluation + +import scala.collection.Map + +import org.apache.spark.SparkContext._ +import org.apache.spark.annotation.Experimental +import org.apache.spark.mllib.linalg.{Matrices, Matrix} +import org.apache.spark.rdd.RDD + +/** + * ::Experimental:: + * Evaluator for multiclass classification. + * + * @param predictionAndLabels an RDD of (prediction, label) pairs. + */ +@Experimental +class MulticlassMetrics(predictionAndLabels: RDD[(Double, Double)]) { + + private lazy val labelCountByClass: Map[Double, Long] = predictionAndLabels.values.countByValue() + private lazy val labelCount: Long = labelCountByClass.values.sum + private lazy val tpByClass: Map[Double, Int] = predictionAndLabels +.map { case (prediction, label) = + (label, if (label == prediction) 1 else 0) +}.reduceByKey(_ + _) +.collectAsMap() + private lazy val fpByClass: Map[Double, Int] = predictionAndLabels +.map { case (prediction, label) = + (prediction, if (prediction != label) 1 else 0) +}.reduceByKey(_ + _) +.collectAsMap() + private lazy val confusions = predictionAndLabels.map { +case (prediction, label) = ((prediction, label), 1) + }.reduceByKey(_ + _).collectAsMap() + + /** + * Returns confusion matrix: + * predicted classes are in columns, + * they are ordered by class label ascending, + * as in labels + */ + lazy val confusionMatrix: Matrix = { +val transposedFlatMatrix = Array.ofDim[Double](labels.size * labels.size) --- End diff -- Save `labels.size` to `n`? Btw, I'm not sure whether we should use `lazy val` here because the result matrix could be 1000x1000, different from other lazy vals used here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48930414 Thanks for reviewing this everyone. I'm all for commenting and cleaning things up here, but if possible I'd like to merge this in today. There are a couple of people blocking on this as its a pretty severe performance bug. How about we just add some TODOs that can be taken care of in a follow up PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1155#discussion_r14891003 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.evaluation + +import scala.collection.Map + +import org.apache.spark.SparkContext._ +import org.apache.spark.annotation.Experimental +import org.apache.spark.mllib.linalg.{Matrices, Matrix} +import org.apache.spark.rdd.RDD + +/** + * ::Experimental:: + * Evaluator for multiclass classification. + * + * @param predictionAndLabels an RDD of (prediction, label) pairs. + */ +@Experimental +class MulticlassMetrics(predictionAndLabels: RDD[(Double, Double)]) { + + private lazy val labelCountByClass: Map[Double, Long] = predictionAndLabels.values.countByValue() + private lazy val labelCount: Long = labelCountByClass.values.sum + private lazy val tpByClass: Map[Double, Int] = predictionAndLabels +.map { case (prediction, label) = + (label, if (label == prediction) 1 else 0) +}.reduceByKey(_ + _) +.collectAsMap() + private lazy val fpByClass: Map[Double, Int] = predictionAndLabels +.map { case (prediction, label) = + (prediction, if (prediction != label) 1 else 0) +}.reduceByKey(_ + _) +.collectAsMap() + private lazy val confusions = predictionAndLabels.map { +case (prediction, label) = ((prediction, label), 1) --- End diff -- The code style is not consistent with the blocks above. Please move `case (prediction, label) =` to the line above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1406#issuecomment-48931344 QA tests have started for PR 1406. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16627/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: move some test file to match src code
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1401#issuecomment-48931698 Merging in master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: move some test file to match src code
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1401 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/1407 SPARK-1215: Clustering: Index out of bounds error Bug fix for JIRA SPARK 1215: Clustering: Index out of bounds error https://issues.apache.org/jira/browse/SPARK-1215 Solution: Print warning, and use duplicate cluster centers so that exactly k centers are returned. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1407.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1407 commit 97f2104bac2ab864c2a03f9a12a4b936557ae6d6 Author: Joseph Bradley joseph.kurata.brad...@gmail.com Date: 2014-05-20T01:35:53Z added RDD::stratifiedSample method and associated unit tests in RDDSuite. Method is built off of RDD::takeSample method. commit 91e83338820158b96cda492668dbed5fff33f19b Author: Joseph Bradley joseph.kurata.brad...@gmail.com Date: 2014-05-20T01:36:30Z added RDD::stratifiedSample method documentation commit d6f8913b7e370a82138b9c623754b32a59c21cf6 Author: Joseph Bradley joseph.kurata.brad...@gmail.com Date: 2014-05-21T04:58:12Z updated stratifiedSample to be more scalable, keeping data in RDDs instead of collecting to the driver commit 21eead6a412508b536358f4e557e2fab23c9c696 Author: Joseph Bradley joseph.kurata.brad...@gmail.com Date: 2014-05-23T19:51:07Z updated stratifiedSample to use selection-rejection to select samples on each partition in 1 pass, rather than pre-selecting indices commit 91f4b19702bc58a77d28316674eace881a81165f Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-09T21:37:25Z merging with new spark commit c0cb5f0d8c6104e3eb6cfa44820ba00b81bc7262 Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-11T18:43:17Z merging with updated spark commit 7d1b812a720cffdefe78ddb6e641930e7ae4975b Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-11T18:47:41Z removed my coding test updates commit 18e5c8ad740871be92c6d7b73f5d35e25641a734 Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-12T01:12:44Z Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite. commit e2bf638c6b3e8cc9cec3362caddb2305109d4c0a Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-14T17:52:33Z Merge remote-tracking branch 'upstream/master' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][CORE] SPARK-2102
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1377#issuecomment-48934357 @pwendell, any thoughts on the additional option for kryo? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-48934486 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2154] Schedule next Driver when one com...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1405#issuecomment-48934747 QA results for PR 1405:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16624/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin closed the pull request at: https://github.com/apache/spark/pull/1390 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48935494 @yhuai suggested a much simpler fix -- I benchmarked this and it gave the same performance boost. I am closing this and opening a new PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [PySpark] hijack hash to make hash of None con...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1371#issuecomment-48935684 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/1361#issuecomment-48935929 @mengxr I added two tests, they check that parameter estimates are accurate, and improve over time. The tests use temporary file writing / file streams, which is clunky, but @tdas will help add dependencies on the streaming test suite so we can use its utilities instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1406#issuecomment-48936402 QA results for PR 1406:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16625/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
GitHub user concretevitamin opened a pull request: https://github.com/apache/spark/pull/1408 [SPARK-2443][SQL] Fix slow read from partitioned tables This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by @yhuai. The below results are updated for this fix. ## Benchmarks Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console. Without the fix: Type | Non-partitioned | Partitioned (1 part) | | - First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s) Stablized runs | 1.21s (1.18s) | 27.6s (27.5s) With this fix: Type | Non-partitioned | Partitioned (1 part) | | - First run | 9.57s (1.46s) | 11.0s (1.69s) Stablized runs | 1.13s (1.10s) | 1.23s (1.19s) You can merge this pull request into a Git repository by running: $ git pull https://github.com/concretevitamin/spark slow-read-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1408.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1408 commit d86e437218f99179934ccd9b4d5d89c02b09459d Author: Zongheng Yang zonghen...@gmail.com Date: 2014-07-14T18:03:07Z Move update initialization out of potentially long loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48936743 New PR here: #1408 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48936856 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48937213 QA tests have started for PR 1408. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16631/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14894946 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon // Create local references so that the outer object isn't serialized. val tableDesc = _tableDesc + val tableSerDeClass = tableDesc.getDeserializerClass + val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc) - hivePartitionRDD.mapPartitions { iter = + hivePartitionRDD.mapPartitions { case iter = --- End diff -- I initially thought in a function context, `{ case x = ... }` will be optimized to `{ x = ... }`. I did a `scalac -print` on a simple program to confirm that this is not the case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-48941276 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-48941686 QA tests have started for PR 1238. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16632/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1406#issuecomment-48941916 QA results for PR 1406:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16626/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14896742 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.RDD + +/** + * New correlation algorithms should implement this trait + */ +trait Correlation { + + /** + * Compute correlation for two datasets. + */ + def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double + + /** + * Compute the correlation matrix S, for the input matrix, where S(i, j) is the correlation + * between column i and j. + */ + def computeCorrelationMatrix(X: RDD[Vector]): Matrix + + /** + * Combine the two input RDD[Double]s into an RDD[Vector] and compute the correlation using the + * correlation implementation for RDD[Vector] + */ + def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): Double = { +val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter = --- End diff -- No can do here since I have a second argument (preservePartitioning) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14896912 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.RDD + +/** + * New correlation algorithms should implement this trait + */ +trait Correlation { + + /** + * Compute correlation for two datasets. + */ + def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double + + /** + * Compute the correlation matrix S, for the input matrix, where S(i, j) is the correlation + * between column i and j. + */ + def computeCorrelationMatrix(X: RDD[Vector]): Matrix + + /** + * Combine the two input RDD[Double]s into an RDD[Vector] and compute the correlation using the + * correlation implementation for RDD[Vector] + */ + def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): Double = { +val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter = + iter.map {case(xi, yi) = new DenseVector(Array(xi, yi))} +}, preservesPartitioning = true) +computeCorrelationMatrix(mat)(0, 1) + } + +} + +/** + * Delegates computation to the specific correlation object based on the input method name + * + * Currently supported correlations: pearson, spearman. + * After new correlation algorithms are added, please update the documentation here and in + * Statistics.scala for the correlation APIs. + * + * Cases are ignored when doing method matching. We also allow initials, e.g. P for pearson, as + * long as initials are unique in the supported set of correlation algorithms. In addition, a --- End diff -- Does that mean no toLowerCase either? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1406#issuecomment-48945794 QA results for PR 1406:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16627/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14897552 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import breeze.linalg.{DenseMatrix = BDM} + +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector} +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.rdd.RDD + +/** + * Compute Pearson correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Pearson correlation can be found at + * http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient + */ +object PearsonCorrelation extends Correlation { + + /** + * Compute the Pearson correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute the Pearson correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val rowMatrix = new RowMatrix(X) +val cov = rowMatrix.computeCovariance() +computeCorrelationMatrixFromCovariance(cov) + } + + /** + * Compute the pearson correlation matrix from the covariance matrix + */ + def computeCorrelationMatrixFromCovariance(covarianceMatrix: Matrix): Matrix = { +val cov = covarianceMatrix.toBreeze.asInstanceOf[BDM[Double]] +val n = cov.cols + +// Compute the standard deviation on the diagonals first +var i = 0 +while (i n) { + cov(i, i) = math.sqrt(cov(i, i)) + i +=1 +} +// or we could put the stddev in its own array to trade space for one less pass over the matrix + +// TODO: use blas.dspr instead to compute the correlation matrix +// if the covariance matrix comes in the upper triangular form for free + +// Loop through columns since cov is column major +var j = 0 +var sigma = 0.0 +while (j n) { + sigma = cov(j, j) + i = 0 + while (i j) { +val covariance = cov(i, j) / (sigma * cov(i, i)) --- End diff -- What do you want returned when cov(i, i) is zero? Double.NaN? 0.0? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-48948316 Jenkins, add to whitelist and test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14898523 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import breeze.linalg.{DenseMatrix = BDM} + +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector} +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.rdd.RDD + +/** + * Compute Pearson correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Pearson correlation can be found at + * http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient + */ +object PearsonCorrelation extends Correlation { + + /** + * Compute the Pearson correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute the Pearson correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val rowMatrix = new RowMatrix(X) +val cov = rowMatrix.computeCovariance() +computeCorrelationMatrixFromCovariance(cov) + } + + /** + * Compute the pearson correlation matrix from the covariance matrix + */ + def computeCorrelationMatrixFromCovariance(covarianceMatrix: Matrix): Matrix = { +val cov = covarianceMatrix.toBreeze.asInstanceOf[BDM[Double]] +val n = cov.cols + +// Compute the standard deviation on the diagonals first +var i = 0 +while (i n) { + cov(i, i) = math.sqrt(cov(i, i)) + i +=1 +} +// or we could put the stddev in its own array to trade space for one less pass over the matrix + +// TODO: use blas.dspr instead to compute the correlation matrix +// if the covariance matrix comes in the upper triangular form for free + +// Loop through columns since cov is column major +var j = 0 +var sigma = 0.0 +while (j n) { + sigma = cov(j, j) + i = 0 + while (i j) { +val covariance = cov(i, j) / (sigma * cov(i, i)) --- End diff -- I think the most honest result is NaN. R will return an error for example. You will get that already as the result of 0.0 / 0.0 in the JVM. It's worth documenting! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-48949060 QA tests have started for PR 1407. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16633/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14898638 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon // Create local references so that the outer object isn't serialized. val tableDesc = _tableDesc + val tableSerDeClass = tableDesc.getDeserializerClass + val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc) - hivePartitionRDD.mapPartitions { iter = + hivePartitionRDD.mapPartitions { case iter = --- End diff -- oh really? how does the generated bytecode differ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14899055 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() --- End diff -- I need the exact index of each entry in order to compute accurate ranks in the case of duplicate values. Doesn't seem like zipWithUniqueId allows me to do that esp when duplicate values fall into different partitions. Suggestions on making it work with zipWithUniqueId? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r14899116 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() +// Attempt to checkpoint the RDD before splitting it into numCols RDD[Double]s to avoid +// computing the lineage prefix multiple times. +// If checkpoint directory not set, cache the RDD instead. +try { + indexed.checkpoint() +} catch { + case e: Exception = indexed.cache() +} + +val numCols = X.first.size +val ranks = new Array[RDD[(Long, Double)]](numCols) + +// Note: we use a for loop here instead of a while loop with a single index variable +// to avoid race condition caused by closure serialization +for (k - 0 until numCols) { + val column = indexed.map {case(vector, index) = { +(vector(k), index)} + } + ranks(k) = getRanks(column) +} + +val ranksMat: RDD[Vector] = makeRankMatrix(ranks) +PearsonCorrelation.computeCorrelationMatrix(ranksMat) + } + + /** + * Compute the ranks for elements in the input RDD, using the average method for ties. + * + * With the average method, elements with the same value receive the same rank that's computed + * by taking the average of their positions in the sorted list. + * e.g. ranks([2, 1, 0, 2]) = [3.5, 2.0, 1.0, 3.5] + */ + private def getRanks(indexed: RDD[(Double, Long)]): RDD[(Long, Double)] = { +// Get elements' positions in the sorted list for computing average rank for duplicate values +val sorted = indexed.sortByKey().zipWithIndex() +val groupedByValue = sorted.groupBy(_._1._1) --- End diff -- Breaking ties is a loose description of what happens. I actually need all items of the same value in the same partition in order to take the average of their positions in the sorted list. I'm open to suggestions on how to make it work with mapPartitions though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48951901 QA results for PR 1408:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16631/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---