[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71227419 Woohoo, looks like this is passing tests! The earlier failure was due to a known flaky streaming test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5383][SQL] Multi alias names support
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4182#issuecomment-71230268 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26028/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5384][mllib] Vectors.sqdist return inco...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4183#issuecomment-71232329 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26029/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5384][mllib] Vectors.sqdist return inco...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4183#issuecomment-71232323 [Test build #26029 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26029/consoleFull) for PR 4183 at commit [`54cbf97`](https://github.com/apache/spark/commit/54cbf97b3b08136ac77d7f2e6265aec9c5206a4b). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4809] Rework Guava library shading.
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/3658#issuecomment-71236254 Ping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-71237900 [Test build #26031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26031/consoleFull) for PR 3916 at commit [`23aa2a9`](https://github.com/apache/spark/commit/23aa2a9c7a0e39987bc487c51e9ad70ecb972e8f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5384][mllib] Vectors.sqdist return inco...
GitHub user hhbyyh opened a pull request: https://github.com/apache/spark/pull/4183 [SPARK-5384][mllib] Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths JIRA issue: https://issues.apache.org/jira/browse/SPARK-5384 Currently `Vectors.sqdist` return inconsistent result for sparse/dense vectors when the vectors have different lengths, please refer to JIRA for sample PR scope: Unify the sqdist logic for dense/sparse vectors and fix the inconsistency, also remove the possible sparse to dense conversion in the original code. For reviewers: Maybe we should first discuss what's the correct behavior. 1. Vectors for sqdist must have the same length, like in breeze? 2. If they can have different lengths, what's the correct result for sqdist? (should the extra part get into calculation?) I'll update PR with more optimization and additional ut afterwards. Thanks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hhbyyh/spark fixDouble Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4183.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4183 commit 54cbf97b3b08136ac77d7f2e6265aec9c5206a4b Author: Yuhao Yang hhb...@gmail.com Date: 2015-01-24T16:03:37Z fix Vectors.sqdist inconsistence --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5384][mllib] Vectors.sqdist return inco...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4183#issuecomment-71220720 [Test build #26029 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26029/consoleFull) for PR 4183 at commit [`54cbf97`](https://github.com/apache/spark/commit/54cbf97b3b08136ac77d7f2e6265aec9c5206a4b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71225754 [Test build #26027 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26027/consoleFull) for PR 4155 at commit [`c334255`](https://github.com/apache/spark/commit/c3342552e03d690ac4beea939b5abd13363698c4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class OutputCommitCoordinatorActor(outputCommitCoordinator: OutputCommitCoordinator)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][streaming][MQTT streaming] some trivia...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4178#issuecomment-71229958 [Test build #26030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26030/consoleFull) for PR 4178 at commit [`66919a3`](https://github.com/apache/spark/commit/66919a34ab1838f0f0dbc2ee76903532fa5117b8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71225766 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26027/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3884#issuecomment-71225070 [Test build #26026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26026/consoleFull) for PR 3884 at commit [`a943e00`](https://github.com/apache/spark/commit/a943e00fd76d1b84a598fa449b5abd99074c2c62). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3884#issuecomment-71225077 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26026/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5383][SQL] Multi alias names support
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4182#issuecomment-71230259 [Test build #26028 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26028/consoleFull) for PR 4182 at commit [`9b7e7c9`](https://github.com/apache/spark/commit/9b7e7c9aa02a2a29eab1c7ba08ee681543904d19). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Alias(child: Expression, names: Seq[String])` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5291][CORE] Add timestamp and reason wh...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4082#issuecomment-71248520 @ksakellis --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][streaming][MQTT streaming] some trivia...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4178#issuecomment-71241536 [Test build #26030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26030/consoleFull) for PR 4178 at commit [`66919a3`](https://github.com/apache/spark/commit/66919a34ab1838f0f0dbc2ee76903532fa5117b8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/3884#issuecomment-71244412 Scala changes look ok to me; I'm not super familiar with the pyspark internals, but the check on `rdd.py` surprised me because I thought RDDs were actually serialized, at least on the Scala side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/4155#discussion_r23473037 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable +import scala.concurrent.duration.FiniteDuration + +import akka.actor.{PoisonPill, ActorRef, Actor} --- End diff -- super nit: sort imports (here and elsewhere) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71252591 I had this (unbased) notion that tasks knew whether they were speculative or not, and thus the non-speculative ones would be able to avoid this extra hop to the driver and just commit things. But it seems that's not the case (and it sort of makes sense, in case the speculative task finishes first), so I guess this approach is fine. One thing that worries me a bit is that I've been told before that akka actors' `onReceive` methods are single-threaded (meaning they'll never be called concurrently, even for messages coming from different remote endpoints). That can become a bottleneck on really large jobs. If that's really true, we should probably look at decoupling the processing of the message from the `onReceive` method so that multiple executors can be serviced concurrently. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-71255179 [Test build #26031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26031/consoleFull) for PR 3916 at commit [`23aa2a9`](https://github.com/apache/spark/commit/23aa2a9c7a0e39987bc487c51e9ad70ecb972e8f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71259112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26033/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71259100 [Test build #26033 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26033/consoleFull) for PR 4173 at commit [`23b2c2d`](https://github.com/apache/spark/commit/23b2c2d1bfb6e6504e3357af5027af579020b22e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/4155#discussion_r23478673 --- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala --- @@ -106,18 +107,25 @@ class SparkHadoopWriter(@transient jobConf: JobConf) val taCtxt = getTaskContext() val cmtr = getOutputCommitter() if (cmtr.needsTaskCommit(taCtxt)) { - try { -cmtr.commitTask(taCtxt) -logInfo (taID + : Committed) - } catch { -case e: IOException = { - logError(Error committing the output of task: + taID.value, e) - cmtr.abortTask(taCtxt) - throw e + val outputCommitCoordinator = SparkEnv.get.outputCommitCoordinator + val conf = SparkEnv.get.conf + val canCommit: Boolean = outputCommitCoordinator.canCommit(jobID, splitID, attemptID) + if (canCommit) { --- End diff -- It would force a new task to recompute everything, but this does highlight that task 2 should throw an error, @JoshRosen? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71262056 We do actually need the processing to be single threaded, as trying to coordinate synchronization on the centralized arbitration logic is a bit of a nightmare. I'm not so convinced; you'd only have a conflict if two tasks are concurrently asking to update the state of the same split ID. Otherwise, state updates can happen in parallel. e.g. if you know all the split IDs up front, you can initialize the data structure to hold all the state; when a commit request arrives, you only lock that particular state object. So requests that arrive for other split IDs can be processed in parallel. (If you don't know all the split IDs up front, you can use something simple like `ConcurrentHashMap` or `ConcurrentSkipListMap` depending on what performance characteristics you want.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] SPARK-5309: Use Dictionary for Binary-S...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4139#issuecomment-71246755 [Test build #26035 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26035/consoleFull) for PR 4139 at commit [`f383c15`](https://github.com/apache/spark/commit/f383c15b64ad0d674c09b70dd632f9a93fce44f6). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][streaming][MQTT streaming] some trivia...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4178#issuecomment-71241547 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26030/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/4155#discussion_r23473132 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable +import scala.concurrent.duration.FiniteDuration + +import akka.actor.{PoisonPill, ActorRef, Actor} + +import org.apache.spark.{SparkConf, Logging} +import org.apache.spark.util.{AkkaUtils, ActorLogReceive} + +private[spark] sealed trait OutputCommitCoordinationMessage extends Serializable + +private[spark] case class StageStarted(stage: Int) extends OutputCommitCoordinationMessage +private[spark] case class StageEnded(stage: Int) extends OutputCommitCoordinationMessage +private[spark] case object StopCoordinator extends OutputCommitCoordinationMessage + +private[spark] case class AskPermissionToCommitOutput( +stage: Int, +task: Long, +taskAttempt: Long) +extends OutputCommitCoordinationMessage + +private[spark] case class TaskCompleted( +stage: Int, +task: Long, +attempt: Long, +successful: Boolean) +extends OutputCommitCoordinationMessage + +/** + * Authority that decides whether tasks can commit output to HDFS. + * + * This lives on the driver, but the actor allows the tasks that commit + * to Hadoop to invoke it. + */ +private[spark] class OutputCommitCoordinator(conf: SparkConf) extends Logging { + + // Initialized by SparkEnv + var coordinatorActor: Option[ActorRef] = None + private val timeout = AkkaUtils.askTimeout(conf) + private val maxAttempts = AkkaUtils.numRetries(conf) + private val retryInterval = AkkaUtils.retryWaitMs(conf) + + private type StageId = Int + private type TaskId = Long + private type TaskAttemptId = Long + + private val authorizedCommittersByStage: + mutable.Map[StageId, mutable.Map[TaskId, TaskAttemptId]] = mutable.HashMap() + + def stageStart(stage: StageId) { +sendToActor(StageStarted(stage)) + } + def stageEnd(stage: StageId) { --- End diff -- super nit: missing an empty line between methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/4155#discussion_r23478213 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -808,6 +810,7 @@ class DAGScheduler( // will be posted, which should always come after a corresponding SparkListenerStageSubmitted // event. stage.latestInfo = StageInfo.fromStage(stage, Some(partitionsToCompute.size)) +outputCommitCoordinator.stageStart(stage.id) --- End diff -- I wonder if it wouldn't be better to use a `SparkListener` to reduce coupling. Although that would potentially introduce race conditions in the code (since `LiveListenerBus` fires events on a separate thread). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/4155#discussion_r23478509 --- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala --- @@ -106,18 +107,25 @@ class SparkHadoopWriter(@transient jobConf: JobConf) val taCtxt = getTaskContext() val cmtr = getOutputCommitter() if (cmtr.needsTaskCommit(taCtxt)) { - try { -cmtr.commitTask(taCtxt) -logInfo (taID + : Committed) - } catch { -case e: IOException = { - logError(Error committing the output of task: + taID.value, e) - cmtr.abortTask(taCtxt) - throw e + val outputCommitCoordinator = SparkEnv.get.outputCommitCoordinator + val conf = SparkEnv.get.conf + val canCommit: Boolean = outputCommitCoordinator.canCommit(jobID, splitID, attemptID) + if (canCommit) { --- End diff -- Hmm. I wonder if this can be a problem. Given the following timeline: Time - (1)(2)(3) (4)--(5) 1: task 1 start 2. task 1 asks for permission to commit, it's granted 3. task 1 fails to commit 4. task 2 starts (doing same work as task 1) 5. task 2 asks for permission to commit, it's denied Wouldn't this code force a new task to be run to recompute everything? Also, wouldn't task 2 actually report itself as successful, and break things, since there is a successful task for that particular split, but it was never committed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71261373 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26034/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71261364 [Test build #26034 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26034/consoleFull) for PR 4173 at commit [`38df669`](https://github.com/apache/spark/commit/38df6699c77fcaeb505350bcc73c5614814efa5d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/4155#discussion_r23473536 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -19,12 +19,13 @@ package org.apache.spark.scheduler import scala.collection.mutable.{ArrayBuffer, HashSet, HashMap, Map} import scala.language.reflectiveCalls -import scala.util.control.NonFatal import org.scalatest.{BeforeAndAfter, FunSuiteLike} import org.scalatest.concurrent.Timeouts import org.scalatest.time.SpanSugar._ +import org.mockito.Mockito.mock --- End diff -- super nit: group with `org.scalatest` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/4155#discussion_r23474074 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable +import scala.concurrent.duration.FiniteDuration + +import akka.actor.{PoisonPill, ActorRef, Actor} + +import org.apache.spark.{SparkConf, Logging} +import org.apache.spark.util.{AkkaUtils, ActorLogReceive} + +private[spark] sealed trait OutputCommitCoordinationMessage extends Serializable + +private[spark] case class StageStarted(stage: Int) extends OutputCommitCoordinationMessage +private[spark] case class StageEnded(stage: Int) extends OutputCommitCoordinationMessage +private[spark] case object StopCoordinator extends OutputCommitCoordinationMessage + +private[spark] case class AskPermissionToCommitOutput( +stage: Int, +task: Long, +taskAttempt: Long) +extends OutputCommitCoordinationMessage + +private[spark] case class TaskCompleted( +stage: Int, +task: Long, +attempt: Long, +successful: Boolean) +extends OutputCommitCoordinationMessage + +/** + * Authority that decides whether tasks can commit output to HDFS. + * + * This lives on the driver, but the actor allows the tasks that commit + * to Hadoop to invoke it. + */ +private[spark] class OutputCommitCoordinator(conf: SparkConf) extends Logging { + + // Initialized by SparkEnv + var coordinatorActor: Option[ActorRef] = None + private val timeout = AkkaUtils.askTimeout(conf) + private val maxAttempts = AkkaUtils.numRetries(conf) + private val retryInterval = AkkaUtils.retryWaitMs(conf) + + private type StageId = Int + private type TaskId = Long + private type TaskAttemptId = Long + + private val authorizedCommittersByStage: + mutable.Map[StageId, mutable.Map[TaskId, TaskAttemptId]] = mutable.HashMap() + + def stageStart(stage: StageId) { +sendToActor(StageStarted(stage)) + } + def stageEnd(stage: StageId) { +sendToActor(StageEnded(stage)) + } + + def canCommit( + stage: StageId, + task: TaskId, + attempt: TaskAttemptId): Boolean = { +askActor(AskPermissionToCommitOutput(stage, task, attempt)) + } + + def taskCompleted( + stage: StageId, + task: TaskId, + attempt: TaskAttemptId, + successful: Boolean) { +sendToActor(TaskCompleted(stage, task, attempt, successful)) + } + + def stop() { --- End diff -- Minor, but I think it's slightly weird that this class mixes methods that should only be called from the driver (such as `stop`) and methods that executors can call safely. Perhaps a check here that this is only being called on the driver side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user mccheah commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71253931 I'm also concerned about the performance ramifications of this. We need to run performance benchmarks. However, the only critical path that is affected by this are tasks that are explicitly saving to Hadoop file. When a task completes, the DAGScheduler sends a message to the OutputCommitCoordinator actor so the DAGScheduler is not blocked by this logic. We do actually need the processing to be single threaded, as trying to coordinate synchronization on the centralized arbitration logic is a bit of a nightmare. I mean, we could allow multiple threads to access the internal state of OutputCommitCoordinator and implement appropriate synchronization logic. I considered an optimization where the driver broadcasts to executors when tasks are being speculated, and the executors of the original tasks would know to check the commit authorization, and skip it for tasks that don't have speculated copies. There's a lot of race conditions that arise from that though, which further underlines the need to centralize everything. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] SPARK-5309: Use Dictionary for Binary-S...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4139#issuecomment-71261303 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26035/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] SPARK-5309: Use Dictionary for Binary-S...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4139#issuecomment-71261296 [Test build #26035 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26035/consoleFull) for PR 4139 at commit [`f383c15`](https://github.com/apache/spark/commit/f383c15b64ad0d674c09b70dd632f9a93fce44f6). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/3884#discussion_r23470546 --- Diff: python/pyspark/rdd.py --- @@ -141,6 +141,17 @@ def id(self): def __repr__(self): return self._jrdd.toString() +def __getnewargs__(self): +# This method is called when attempting to pickle an RDD, which is always an error: +raise Exception( +It appears that you are attempting to broadcast an RDD or reference an RDD from an +action or transforamtion. RDD transformations and actions can only be invoked by the --- End diff -- typo: transforamtion --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71244959 [Test build #26033 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26033/consoleFull) for PR 4173 at commit [`23b2c2d`](https://github.com/apache/spark/commit/23b2c2d1bfb6e6504e3357af5027af579020b22e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71246679 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26032/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71246730 [Test build #26034 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26034/consoleFull) for PR 4173 at commit [`38df669`](https://github.com/apache/spark/commit/38df6699c77fcaeb505350bcc73c5614814efa5d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-71255189 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26031/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5351][GraphX] Do not use Partitioner.de...
Github user ankurdave commented on the pull request: https://github.com/apache/spark/pull/4136#issuecomment-71265542 @JoshRosen No, it doesn't seem to trigger the Snappy error! After the previous attempted fix (#1763, 9b225ac3072de522b40b46aba6df1f1c231f13ef), the GraphX unit tests (`for i in {1..10}; do sbt/sbt 'graphx/test:test-only org.apache.spark.graphx.*'; done`) would fail 3 out of 10 times, but they always succeed now. I think we can merge this! I'm just going to bisect to see what fixed the error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23488297 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,273 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.LogicalRDD +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} + + +class DataFrame( +val sqlContext: SQLContext, +val baseLogicalPlan: LogicalPlan, +operatorsEnabled: Boolean) + extends DataFrameSpecificApi with RDDApi[Row] { + + def this(sqlContext: Option[SQLContext], plan: Option[LogicalPlan]) = +this(sqlContext.orNull, plan.orNull, sqlContext.isDefined plan.isDefined) + + def this(sqlContext: SQLContext, plan: LogicalPlan) = this(sqlContext, plan, true) + + @transient + protected[sql] lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan) + + @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan match { +// For various commands (like DDL) and queries with side effects, we force query optimization to +// happen right away to let these side effects take place eagerly. +case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: WriteToFile = + LogicalRDD(queryExecution.analyzed.output, queryExecution.toRdd)(sqlContext) +case _ = + baseLogicalPlan + } + + private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): DataFrame = { +new DataFrame(sqlContext, logicalPlan, true) + } + + protected[sql] def numericColumns: Seq[Expression] = { +schema.fields.filter(_.dataType.isInstanceOf[NumericType]).map { n = + logicalPlan.resolve(n.name, sqlContext.analyzer.resolver).get +} + } + + protected[sql] def resolve(colName: String): NamedExpression = { +logicalPlan.resolve(colName, sqlContext.analyzer.resolver).getOrElse( + throw new RuntimeException(sCannot resolve column name $colName)) + } + + def toSchemaRDD: DataFrame = this + + override def schema: StructType = queryExecution.analyzed.schema + + override def dtypes: Array[(String, String)] = schema.fields.map { field = +(field.name, field.dataType.toString) + } + + override def columns: Array[String] = schema.fields.map(_.name) + + override def printSchema(): Unit = println(schema.treeString) + + override def show(): Unit = { +??? + } + + override def join(right: DataFrame): DataFrame = { +Join(logicalPlan, right.logicalPlan, joinType = Inner, None) + } + + override def join(right: DataFrame, joinExprs: Column): DataFrame = { +Join(logicalPlan, right.logicalPlan, Inner, Some(joinExprs.expr)) + } + + override def join(right: DataFrame, joinType: String, joinExprs: Column): DataFrame = { +Join(logicalPlan, right.logicalPlan, JoinType(joinType), Some(joinExprs.expr)) + } + + override def sort(colName: String): DataFrame = { +Sort(Seq(SortOrder(apply(colName).expr, Ascending)), global = true, logicalPlan) + } + + @scala.annotation.varargs + override def sort(sortExpr: Column, sortExprs: Column*): DataFrame = { +
[GitHub] spark pull request: [SPARK-5384][mllib] Vectors.sqdist return inco...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4183#issuecomment-71276246 I agree that vectors must have the same length and we should check it. It may not be necessary to change the implementation. I saw couple performance issues in your code. For example, unnecessary index lookups. I would suggest only adding the check in this PR. If you want to update the implementation, let's do it in another PR with micro-benchmark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5207] [MLLIB] StandardScalerModel mean ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4140#discussion_r23486231 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -61,20 +61,30 @@ class StandardScaler(withMean: Boolean, withStd: Boolean) extends Logging { * :: Experimental :: * Represents a StandardScaler model that can transform vectors. * - * @param withMean whether to center the data before scaling - * @param withStd whether to scale the data to have unit standard deviation * @param mean column mean values * @param variance column variance values + * @param withMean whether to center the data before scaling + * @param withStd whether to scale the data to have unit standard deviation */ @Experimental -class StandardScalerModel private[mllib] ( -val withMean: Boolean, -val withStd: Boolean, +class StandardScalerModel ( val mean: Vector, -val variance: Vector) extends VectorTransformer { +val variance: Vector, +private var withMean: Boolean = false, +private var withStd: Boolean = true) extends VectorTransformer { --- End diff -- Also, users will want to know if `withMean` or `withStd` is used, do we really need to have them as private variables? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5207] [MLLIB] StandardScalerModel mean ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/4140#issuecomment-71281849 For the unit-test part, is it possible not to change too much? Also, it will be easier to debug if the assertion is in the test instead of abstract out. For example, having `validateConstant` function is not necessary, probably more easy to read to have all the assert code in the test. Having the data as global variables is okay for me. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23488167 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,273 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.LogicalRDD +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} + + +class DataFrame( +val sqlContext: SQLContext, +val baseLogicalPlan: LogicalPlan, +operatorsEnabled: Boolean) + extends DataFrameSpecificApi with RDDApi[Row] { + + def this(sqlContext: Option[SQLContext], plan: Option[LogicalPlan]) = +this(sqlContext.orNull, plan.orNull, sqlContext.isDefined plan.isDefined) + + def this(sqlContext: SQLContext, plan: LogicalPlan) = this(sqlContext, plan, true) + + @transient + protected[sql] lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan) + + @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan match { +// For various commands (like DDL) and queries with side effects, we force query optimization to +// happen right away to let these side effects take place eagerly. +case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: WriteToFile = + LogicalRDD(queryExecution.analyzed.output, queryExecution.toRdd)(sqlContext) +case _ = + baseLogicalPlan + } + + private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): DataFrame = { +new DataFrame(sqlContext, logicalPlan, true) + } + + protected[sql] def numericColumns: Seq[Expression] = { +schema.fields.filter(_.dataType.isInstanceOf[NumericType]).map { n = + logicalPlan.resolve(n.name, sqlContext.analyzer.resolver).get +} + } + + protected[sql] def resolve(colName: String): NamedExpression = { +logicalPlan.resolve(colName, sqlContext.analyzer.resolver).getOrElse( + throw new RuntimeException(sCannot resolve column name $colName)) + } + + def toSchemaRDD: DataFrame = this + + override def schema: StructType = queryExecution.analyzed.schema + + override def dtypes: Array[(String, String)] = schema.fields.map { field = +(field.name, field.dataType.toString) + } + + override def columns: Array[String] = schema.fields.map(_.name) + + override def printSchema(): Unit = println(schema.treeString) + + override def show(): Unit = { +??? + } + + override def join(right: DataFrame): DataFrame = { +Join(logicalPlan, right.logicalPlan, joinType = Inner, None) + } + + override def join(right: DataFrame, joinExprs: Column): DataFrame = { +Join(logicalPlan, right.logicalPlan, Inner, Some(joinExprs.expr)) + } + + override def join(right: DataFrame, joinType: String, joinExprs: Column): DataFrame = { +Join(logicalPlan, right.logicalPlan, JoinType(joinType), Some(joinExprs.expr)) + } + + override def sort(colName: String): DataFrame = { --- End diff -- support sort by multiple columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23488501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,273 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.LogicalRDD +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} + + +class DataFrame( +val sqlContext: SQLContext, +val baseLogicalPlan: LogicalPlan, +operatorsEnabled: Boolean) + extends DataFrameSpecificApi with RDDApi[Row] { + + def this(sqlContext: Option[SQLContext], plan: Option[LogicalPlan]) = +this(sqlContext.orNull, plan.orNull, sqlContext.isDefined plan.isDefined) + + def this(sqlContext: SQLContext, plan: LogicalPlan) = this(sqlContext, plan, true) + + @transient + protected[sql] lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan) + + @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan match { +// For various commands (like DDL) and queries with side effects, we force query optimization to +// happen right away to let these side effects take place eagerly. +case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: WriteToFile = + LogicalRDD(queryExecution.analyzed.output, queryExecution.toRdd)(sqlContext) +case _ = + baseLogicalPlan + } + + private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): DataFrame = { +new DataFrame(sqlContext, logicalPlan, true) + } + + protected[sql] def numericColumns: Seq[Expression] = { +schema.fields.filter(_.dataType.isInstanceOf[NumericType]).map { n = + logicalPlan.resolve(n.name, sqlContext.analyzer.resolver).get +} + } + + protected[sql] def resolve(colName: String): NamedExpression = { +logicalPlan.resolve(colName, sqlContext.analyzer.resolver).getOrElse( + throw new RuntimeException(sCannot resolve column name $colName)) + } + + def toSchemaRDD: DataFrame = this + + override def schema: StructType = queryExecution.analyzed.schema + + override def dtypes: Array[(String, String)] = schema.fields.map { field = +(field.name, field.dataType.toString) + } + + override def columns: Array[String] = schema.fields.map(_.name) + + override def printSchema(): Unit = println(schema.treeString) + + override def show(): Unit = { +??? + } + + override def join(right: DataFrame): DataFrame = { +Join(logicalPlan, right.logicalPlan, joinType = Inner, None) + } + + override def join(right: DataFrame, joinExprs: Column): DataFrame = { +Join(logicalPlan, right.logicalPlan, Inner, Some(joinExprs.expr)) + } + + override def join(right: DataFrame, joinType: String, joinExprs: Column): DataFrame = { +Join(logicalPlan, right.logicalPlan, JoinType(joinType), Some(joinExprs.expr)) + } + + override def sort(colName: String): DataFrame = { +Sort(Seq(SortOrder(apply(colName).expr, Ascending)), global = true, logicalPlan) + } + + @scala.annotation.varargs + override def sort(sortExpr: Column, sortExprs: Column*): DataFrame = { +
[GitHub] spark pull request: [SPARK-5291][CORE] Add timestamp and reason wh...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4082#issuecomment-71272446 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-71284930 [Test build #26036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26036/consoleFull) for PR 1290 at commit [`d18e9b5`](https://github.com/apache/spark/commit/d18e9b5460019970d5bcbb5a0e816aff5a05bf39). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23488078 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,273 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.LogicalRDD +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} + + +class DataFrame( +val sqlContext: SQLContext, +val baseLogicalPlan: LogicalPlan, +operatorsEnabled: Boolean) + extends DataFrameSpecificApi with RDDApi[Row] { + + def this(sqlContext: Option[SQLContext], plan: Option[LogicalPlan]) = +this(sqlContext.orNull, plan.orNull, sqlContext.isDefined plan.isDefined) + + def this(sqlContext: SQLContext, plan: LogicalPlan) = this(sqlContext, plan, true) + + @transient + protected[sql] lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan) + + @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan match { +// For various commands (like DDL) and queries with side effects, we force query optimization to +// happen right away to let these side effects take place eagerly. +case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: WriteToFile = + LogicalRDD(queryExecution.analyzed.output, queryExecution.toRdd)(sqlContext) +case _ = + baseLogicalPlan + } + + private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): DataFrame = { +new DataFrame(sqlContext, logicalPlan, true) + } + + protected[sql] def numericColumns: Seq[Expression] = { +schema.fields.filter(_.dataType.isInstanceOf[NumericType]).map { n = + logicalPlan.resolve(n.name, sqlContext.analyzer.resolver).get +} + } + + protected[sql] def resolve(colName: String): NamedExpression = { +logicalPlan.resolve(colName, sqlContext.analyzer.resolver).getOrElse( + throw new RuntimeException(sCannot resolve column name $colName)) + } + + def toSchemaRDD: DataFrame = this + + override def schema: StructType = queryExecution.analyzed.schema + + override def dtypes: Array[(String, String)] = schema.fields.map { field = +(field.name, field.dataType.toString) + } + + override def columns: Array[String] = schema.fields.map(_.name) + + override def printSchema(): Unit = println(schema.treeString) + + override def show(): Unit = { +??? + } + + override def join(right: DataFrame): DataFrame = { +Join(logicalPlan, right.logicalPlan, joinType = Inner, None) + } + + override def join(right: DataFrame, joinExprs: Column): DataFrame = { +Join(logicalPlan, right.logicalPlan, Inner, Some(joinExprs.expr)) + } + + override def join(right: DataFrame, joinType: String, joinExprs: Column): DataFrame = { --- End diff -- It's easier to do in Python/R if putting joinType at the end --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a
[GitHub] spark pull request: [SPARK-5207] [MLLIB] StandardScalerModel mean ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4140#discussion_r23485163 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -61,20 +61,30 @@ class StandardScaler(withMean: Boolean, withStd: Boolean) extends Logging { * :: Experimental :: * Represents a StandardScaler model that can transform vectors. * - * @param withMean whether to center the data before scaling - * @param withStd whether to scale the data to have unit standard deviation * @param mean column mean values * @param variance column variance values + * @param withMean whether to center the data before scaling + * @param withStd whether to scale the data to have unit standard deviation */ @Experimental -class StandardScalerModel private[mllib] ( -val withMean: Boolean, -val withStd: Boolean, +class StandardScalerModel ( val mean: Vector, -val variance: Vector) extends VectorTransformer { +val variance: Vector, +private var withMean: Boolean = false, +private var withStd: Boolean = true) extends VectorTransformer { --- End diff -- The default argument is not friendly for Java though; why don't we add another constructor which takes only mean and variance? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5291][CORE] Add timestamp and reason wh...
Github user ksakellis commented on the pull request: https://github.com/apache/spark/pull/4082#issuecomment-71282577 LGTM - nice addition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-984 [BUILD] SPARK_TOOLS_JAR not set if m...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/4181 SPARK-984 [BUILD] SPARK_TOOLS_JAR not set if multiple tools jars exists Given the discussion in https://issues.apache.org/jira/browse/SPARK-984, this seems to be the outcome, but I'm not 100% sure if this is still the desired resolution. Simpler than modifying the scripts to deal with multiple tools assemblies if in fact these tools are not run specially this way by `spark-class`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-984 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4181.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4181 commit 83590eae39a7cb3ef13d3060e0f001564c7aed73 Author: Sean Owen so...@cloudera.com Date: 2015-01-23T12:27:35Z Remove SPARK_TOOLS_JAR usages --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3519#issuecomment-71189621 [Test build #26025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26025/consoleFull) for PR 3519 at commit [`12151e6`](https://github.com/apache/spark/commit/12151e6b40e70c5d0a8dde8a6e4d600709eb0f12). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-984 [BUILD] SPARK_TOOLS_JAR not set if m...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4181#issuecomment-71194064 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26024/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-984 [BUILD] SPARK_TOOLS_JAR not set if m...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4181#issuecomment-71194057 [Test build #26024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26024/consoleFull) for PR 4181 at commit [`83590ea`](https://github.com/apache/spark/commit/83590eae39a7cb3ef13d3060e0f001564c7aed73). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-984 [BUILD] SPARK_TOOLS_JAR not set if m...
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/4181 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-984 [BUILD] SPARK_TOOLS_JAR not set if m...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4181#issuecomment-71194299 Ah. This makes Mima stop working. OK, this isn't an option! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/4173 [SPARK-5097][WIP] DataFrame as the common abstraction for structured data This is early work in progress. I am submitting the PR mainly wanted to get Jenkins to run through the tests so I don't have to do that on my machine. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark df1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4173.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4173 commit 08d82010d974b70ab44715a879785488356b408f Author: Reynold Xin r...@databricks.com Date: 2015-01-22T07:47:19Z Checkpoint: SQL module compiles! commit 3ccf3217d482f9c38d8122d185bb0a041e772d0e Author: Reynold Xin r...@databricks.com Date: 2015-01-22T08:04:32Z SQLContext minor patch. commit 83e872140e75c1b353479f0c7a6ff3501f609646 Author: Reynold Xin r...@databricks.com Date: 2015-01-22T08:17:22Z Fixed test cases in SQL except ParquetIOSuite. commit 9e4a7d063e0cdf9ef83793eeb4808f290130b435 Author: Reynold Xin r...@databricks.com Date: 2015-01-22T08:19:29Z Fixed compilation error. commit fc5acc50f3227ae90f86d6684945b200c96efced Author: Reynold Xin r...@databricks.com Date: 2015-01-22T08:44:59Z Hive module. commit feb43ef0e98d72a1372e4f3d5b1a6c811a8a13bb Author: Reynold Xin r...@databricks.com Date: 2015-01-23T08:02:09Z Made MLlib and examples compile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5351][GraphX] Do not use Partitioner.de...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4136#issuecomment-71160721 [Test build #26004 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26004/consoleFull) for PR 4136 at commit [`0a2f32b`](https://github.com/apache/spark/commit/0a2f32b0283b4fe319a23f7f4541d1531ddcbab2). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5214][Test] Add a test to demonstrate E...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/4174 [SPARK-5214][Test] Add a test to demonstrate EventLoop can be stopped in the event loop thread You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-5214-unittest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4174.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4174 commit f0b18f940ce4b49711b5d74c4bca4a8391241bb7 Author: zsxwing zsxw...@gmail.com Date: 2015-01-23T08:17:23Z Add a test to demonstrate EventLoop can be stopped in the event loop thread --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4233] [SQL] WIP:Simplify the UDAF API (...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3247#issuecomment-71162789 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26013/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4233] [SQL] WIP:Simplify the UDAF API (...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3247#issuecomment-71162787 [Test build #26013 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26013/consoleFull) for PR 3247 at commit [`feb00c8`](https://github.com/apache/spark/commit/feb00c891d3ddebc056345831d2e8a30e46d6ed4). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class UnresolvedFunction(` * `trait AggregateFunction ` * `trait AggregateExpression extends Expression with AggregateFunction ` * `abstract class UnaryAggregateExpression extends UnaryExpression with AggregateExpression ` * `case class Min(` * `case class Average(child: Expression, distinct: Boolean = false)` * `case class Max(child: Expression, distinct: Boolean = false)` * `case class Count(child: Expression)` * `case class CountDistinct(children: Seq[Expression])` * `case class Sum(child: Expression, distinct: Boolean = false)` * `case class First(child: Expression, distinct: Boolean = false)` * `case class Last(child: Expression, distinct: Boolean = false)` * `sealed case class AggregateFunctionBind(` * `sealed class InputBufferSeens(` * `sealed trait Aggregate ` * `sealed trait PreShuffle extends Aggregate ` * `sealed trait PostShuffle extends Aggregate ` * `case class AggregatePreShuffle(` * `case class AggregatePostShuffle(` * `case class DistinctAggregate(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4233] [SQL] WIP:Simplify the UDAF API (...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3247#issuecomment-71162714 [Test build #26013 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26013/consoleFull) for PR 3247 at commit [`feb00c8`](https://github.com/apache/spark/commit/feb00c891d3ddebc056345831d2e8a30e46d6ed4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5376][Mesos] MesosExecutor should have ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4170#issuecomment-71166652 [Test build #26015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26015/consoleFull) for PR 4170 at commit [`d714e8b`](https://github.com/apache/spark/commit/d714e8bb6b699e5ec2a315df65cee0f4cf7765e5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3650][GraphX] There will be an ArrayInd...
GitHub user Leolh opened a pull request: https://github.com/apache/spark/pull/4176 [SPARK-3650][GraphX] There will be an ArrayIndexOutOfBoundsException if ... ...the format of the source file is wrong There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong You can merge this pull request into a Git repository by running: $ git pull https://github.com/Leolh/spark patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4176.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4176 commit 23767f1239341146df49dc4d4c4956d7a3b48e0f Author: Leolh leosand...@gmail.com Date: 2015-01-23T09:27:02Z [SPARK-3650][GraphX] There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3298][SQL] Add flag control overwrite r...
Github user OopsOutOfMemory commented on the pull request: https://github.com/apache/spark/pull/4175#issuecomment-71167869 /cc @scwf @chenghao-intel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5382: Use SPARK_CONF_DIR in spark-class ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4179#issuecomment-71294854 @andrewor14 since you reviewed the other PR for `SPARK_CONF_DIR`, can you take a quick look at this and #4177 to see if we want to pull it in for 1.2.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5351][GraphX] Do not use Partitioner.de...
Github user ankurdave commented on the pull request: https://github.com/apache/spark/pull/4136#issuecomment-71297997 Oh, thanks! Looks like that was the problem all along; stopping the SparkContext fixes the problem. I'm going to merge this with the amended test now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5351][GraphX] Do not use Partitioner.de...
Github user ankurdave commented on the pull request: https://github.com/apache/spark/pull/4136#issuecomment-71286155 @JoshRosen Actually, it seems the test failures still occur, but only when I add a [unit test](https://github.com/apache/spark/commit/9b225ac3072de522b40b46aba6df1f1c231f13ef#diff-3ade47bc293ef06e43c25f1ac1f6783bR354) that sets spark.default.parallelism. Adding the test causes subsequent tests within the same run to fail with exceptions like ``` java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0 ``` and ``` java.io.IOException: PARSING_ERROR(2) ``` The exception traces always occur in TorrentBroadcast. It seems like setting spark.default.parallelism is causing some kind of side effect that corrupts broadcasts in later unit tests, which is strange since (1) each unit test should have its own SparkContext and therefore its own temp directory, and (2) I'm only passing spark.default.parallelism to SparkConf/SparkContext, not setting it as a system property. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Bug fix for SPARK-5242: ec2/spark_ec2.py lauc...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/4038#issuecomment-71286989 @voukka @nchammas - This high level goal looks fine to me. However I the function get_hostname is being called on all instances (its inside a loop) in many cases. I wonder if we can do something more lightweight by exploiting the fact that you typically want to use the same kind of resolution for all machines. What this will mean is that for the very first machine we will try all four options and then just save which field was used -- Then the function just picks the appropriate field going forward. Will this solve your use case ? Or are there use cases where we need to do this for every instance ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3986#issuecomment-71289144 @nchammas @GenTang - The `logging.basicConfig` seems to have been around since the very beginning [1]. I don't know much about Python so I can't recommend keeping it or removing it. @JoshRosen can comment on that. Other than that this solution looks fine to me. It is unfortunate that we have so many custom sleep calls across the file, but I don't think there is much else we can do given the EC2 API we have right now. [1] https://github.com/mesos/spark/blob/08c50ad1fcf323f62c80dfeb8f1caaf164211e0b/ec2/spark_ec2.py#L538 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-71290964 [Test build #26036 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26036/consoleFull) for PR 1290 at commit [`d18e9b5`](https://github.com/apache/spark/commit/d18e9b5460019970d5bcbb5a0e816aff5a05bf39). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OutputCanvas2D(wd: Int, ht: Int) extends Canvas ` * `class OutputFrame2D( title: String ) extends Frame( title ) ` * `class OutputCanvas3D(wd: Int, ht: Int, shadowFrac: Double) extends Canvas ` * `class OutputFrame3D(title: String, shadowFrac: Double) extends Frame(title) ` * `trait ANNClassifierHelper ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-71290975 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26036/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3884#issuecomment-71292584 @vanzin Thanks for looking this over. The Python `RDD` objects themselves are never actually serialized and are used internally in a way that's slightly different than in Scala/Java Spark. In the existing code, any attempt to serialize instances of those Python classes throws an exception in the `__getnewargs__` method, which is why I was able to add new exceptions there. I'm going to fix the spelling error, take one final look over this, and commit it so we can get it into the first 1.2.1 RC. I saw a couple of mailing list questions yesterday that could have been prevented by this patch, which illustrates why I really want to get this into our next maintenance release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/3884#issuecomment-71298244 Thank you @JoshRosen for working on usability issues like this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...
Github user ilganeli closed the pull request at: https://github.com/apache/spark/pull/3518 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...
Github user ilganeli commented on the pull request: https://github.com/apache/spark/pull/3518#issuecomment-71299062 Hey @pwendell - not a problem. The solutions are similar but Reynold's has fewer moving parts. I appreciate the recognition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5351][GraphX] Do not use Partitioner.de...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4136#issuecomment-71293352 @ankurdave The exception from the new unit test sounds suspiciously similar to https://issues.apache.org/jira/browse/SPARK-4133. Your new test creates a new `sc` local variable then never stops it, so if that test runs first then its leaked context will keep running and will interfere with contexts created in the other tests. Because some SparkSQL tests could not pass without it, our unit tests set `spark.driver.allowMultipleContexts=false` to disable the check, so this might be hard to notice. If you have `unit-tests.log`, though, I'd take a look to see whether there are any warning messages about multiple contexts. I'd check to see if those failures still persist after properly cleaning up the SparkContext created in your new test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3884#issuecomment-71294206 [Test build #26037 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26037/consoleFull) for PR 3884 at commit [`a38774b`](https://github.com/apache/spark/commit/a38774b8892a85184520078a2187e9ce2a190038). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3884#issuecomment-71297060 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26037/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3884#issuecomment-71297055 [Test build #26037 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26037/consoleFull) for PR 3884 at commit [`a38774b`](https://github.com/apache/spark/commit/a38774b8892a85184520078a2187e9ce2a190038). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5351][GraphX] Do not use Partitioner.de...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4136 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3884 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3884#issuecomment-71295041 Alright, I've merged this into `master` (1.3.0) and `branch-1.2` (1.2.1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5207] [MLLIB] StandardScalerModel mean ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4140#issuecomment-71297633 [Test build #26038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26038/consoleFull) for PR 4140 at commit [`997d2e0`](https://github.com/apache/spark/commit/997d2e0a3bbfd1be6c0a556393bbcfbd18404f77). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5207] [MLLIB] StandardScalerModel mean ...
Github user ogeagla commented on the pull request: https://github.com/apache/spark/pull/4140#issuecomment-71297662 @dbtsai that makes sense. I've changed this back in latest commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5351][GraphX] Do not use Partitioner.de...
Github user ankurdave commented on the pull request: https://github.com/apache/spark/pull/4136#issuecomment-71298562 Merged into master branch-1.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71159715 [Test build #26008 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26008/consoleFull) for PR 4173 at commit [`feb43ef`](https://github.com/apache/spark/commit/feb43ef0e98d72a1372e4f3d5b1a6c811a8a13bb). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5374][CORE] abstract RDD's DAG graph it...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4134#issuecomment-71159737 Thanks for doing it. I took a quick look at this. While it does reduce the LOC, I feel the change is not necessary and actually makes the code harder to understand with the closures. Do we really want something like this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5214][Test] Add a test to demonstrate E...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4174#issuecomment-71161124 [Test build #26009 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26009/consoleFull) for PR 4174 at commit [`7aaa2d7`](https://github.com/apache/spark/commit/7aaa2d73d559ef6f0b2a18f14800727994e39a4e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71161215 [Test build #26010 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26010/consoleFull) for PR 4173 at commit [`1532e1e`](https://github.com/apache/spark/commit/1532e1e97209b200a03e9a093de289228e77a288). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71161217 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26010/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71162544 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26011/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r23437961 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -106,7 +106,22 @@ private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) ex if (interruptThread taskThread != null) { taskThread.interrupt() } - } + } + + override def hashCode(): Int = { +val state = Seq(stageId, partitionId) +state.map(_.hashCode()).foldLeft(0)((a, b) = 31 * a + b) --- End diff -- Maybe a better way is `(stageId + partitionId) * (stageId + partitionId + 1) / 2 + partitionId`. See http://en.wikipedia.org/wiki/Pairing_function#Cantor_pairing_function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3298][SQL] Add flag control overwrite r...
GitHub user OopsOutOfMemory opened a pull request: https://github.com/apache/spark/pull/4175 [SPARK-3298][SQL] Add flag control overwrite registerAsTable / registerTempTable https://issues.apache.org/jira/browse/SPARK-3298 add a flag `allowOverwrite` to control registerTempTable. By default it is `true` means register table will overwrite the previous table. (like var tempTable) If set it to `false`, means the registerTempTable command will check the table name exists or not and if exists, throw a table already exists exception. Then you should drop it first and then register it again. (like final tempTable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/OopsOutOfMemory/spark register Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4175.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4175 commit 49613a2f9dbd53c189cc54991f778bc55c1ec918 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-23T08:09:38Z initial commit 6fb569451dd0b880f9865a67c2851071dba59fdb Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-23T09:09:00Z refine test sutie correct inconsistence --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5262] [SQL] coalesce should allow NullT...
Github user adrian-wang commented on the pull request: https://github.com/apache/spark/pull/4057#issuecomment-71167347 Yes, I moved my work to FunctionArgumentConversion, and since #4040 is reverted due to conflicts, I added the code together here. So I leave Coalesce() untouched, since we would have the same type in Coalesce for sure. I'll change the title accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][streaming][MQTT streaming] some trivia...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4178#issuecomment-71181359 [Test build #26020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26020/consoleFull) for PR 4178 at commit [`5857989`](https://github.com/apache/spark/commit/5857989426db9cc51e34bf09942101750fff60ea). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4158#issuecomment-71181584 @chenghao-intel overall it looks good for me except for small comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org