[GitHub] spark issue #12258: [SPARK-14485][CORE] ignore task finished for executor lo...
Github user zhonghaihua commented on the issue: https://github.com/apache/spark/pull/12258 @vanzin my JIRA username is `iward`. Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12258: [SPARK-14485][CORE] ignore task finished for executor lo...
Github user zhonghaihua commented on the issue: https://github.com/apache/spark/pull/12258 @vanzin retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/12258#discussion_r66000194 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl( } taskIdToTaskSetManager.get(tid) match { case Some(taskSet) => +var executorId: String = null if (TaskState.isFinished(state)) { taskIdToTaskSetManager.remove(tid) taskIdToExecutorId.remove(tid).foreach { execId => +executorId = execId if (executorIdToTaskCount.contains(execId)) { executorIdToTaskCount(execId) -= 1 } } } if (state == TaskState.FINISHED) { + // In some case, executor has already removed by driver for heartbeats timeout, but + // at sometime, before executor killed by cluster, the task of running on this + // executor is finished and return task success state to driver. However, this kinds + // of task should be ignored, because the task on this executor is already re-queued + // by driver. For more details, can check in SPARK-14485. taskSet.removeRunningTask(tid) - taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) + if (executorId != null && !executorIdToTaskCount.contains(executorId)) { +logInfo( + ("Ignoring update with state %s for TID %s because its executor has already " + --- End diff -- Yeah, thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/12258#discussion_r65997697 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl( } taskIdToTaskSetManager.get(tid) match { case Some(taskSet) => +var executorId: String = null if (TaskState.isFinished(state)) { taskIdToTaskSetManager.remove(tid) taskIdToExecutorId.remove(tid).foreach { execId => +executorId = execId if (executorIdToTaskCount.contains(execId)) { executorIdToTaskCount(execId) -= 1 } } } if (state == TaskState.FINISHED) { + // In some case, executor has already removed by driver for heartbeats timeout, but + // at sometime, before executor killed by cluster, the task of running on this + // executor is finished and return task success state to driver. However, this kinds + // of task should be ignored, because the task on this executor is already re-queued + // by driver. For more details, can check in SPARK-14485. taskSet.removeRunningTask(tid) - taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) + if (executorId != null && !executorIdToTaskCount.contains(executorId)) { +logInfo( + ("Ignoring update with state %s for TID %s because its executor has already " + --- End diff -- I see, thanks for your comments. @vanzin Besides, according to your point of view, I think this log https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L371 should also be changed use interpolation, right ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12258: [SPARK-14485][CORE] ignore task finished for executor lo...
Github user zhonghaihua commented on the issue: https://github.com/apache/spark/pull/12258 @vanzin @andrewor14 Could someone review it, or any thoughts or concerns for this patch ? Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12258: [SPARK-14485][CORE] ignore task finished for executor lo...
Github user zhonghaihua commented on the issue: https://github.com/apache/spark/pull/12258 Hi, @vanzin , the code has changed, could you review it, please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/12258#discussion_r65797802 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl( } taskIdToTaskSetManager.get(tid) match { case Some(taskSet) => +var executorId: String = null if (TaskState.isFinished(state)) { taskIdToTaskSetManager.remove(tid) taskIdToExecutorId.remove(tid).foreach { execId => +executorId = execId if (executorIdToTaskCount.contains(execId)) { executorIdToTaskCount(execId) -= 1 } } } if (state == TaskState.FINISHED) { - taskSet.removeRunningTask(tid) - taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) + // In some case, executor has already removed by driver for heartbeats timeout, but + // at sometime, before executor killed by cluster, the task of running on this + // executor is finished and return task success state to driver. However, this kinds + // of task should be ignored, because the task on this executor is already re-queued + // by driver. For more details, can check in SPARK-14485. + if (executorId.ne(null) && !executorIdToTaskCount.contains(executorId)) { +taskSet.removeRunningTask(tid) +logWarning( --- End diff -- Yes, you are right. I will change it soon, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/12258#discussion_r6579 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl( } taskIdToTaskSetManager.get(tid) match { case Some(taskSet) => +var executorId: String = null if (TaskState.isFinished(state)) { taskIdToTaskSetManager.remove(tid) taskIdToExecutorId.remove(tid).foreach { execId => +executorId = execId if (executorIdToTaskCount.contains(execId)) { executorIdToTaskCount(execId) -= 1 } } } if (state == TaskState.FINISHED) { - taskSet.removeRunningTask(tid) - taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) + // In some case, executor has already removed by driver for heartbeats timeout, but + // at sometime, before executor killed by cluster, the task of running on this + // executor is finished and return task success state to driver. However, this kinds + // of task should be ignored, because the task on this executor is already re-queued + // by driver. For more details, can check in SPARK-14485. + if (executorId.ne(null) && !executorIdToTaskCount.contains(executorId)) { +taskSet.removeRunningTask(tid) --- End diff -- I see. Fix it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/12258#discussion_r65797773 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl( } taskIdToTaskSetManager.get(tid) match { case Some(taskSet) => +var executorId: String = null if (TaskState.isFinished(state)) { taskIdToTaskSetManager.remove(tid) taskIdToExecutorId.remove(tid).foreach { execId => +executorId = execId if (executorIdToTaskCount.contains(execId)) { executorIdToTaskCount(execId) -= 1 } } } if (state == TaskState.FINISHED) { - taskSet.removeRunningTask(tid) - taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) + // In some case, executor has already removed by driver for heartbeats timeout, but + // at sometime, before executor killed by cluster, the task of running on this + // executor is finished and return task success state to driver. However, this kinds + // of task should be ignored, because the task on this executor is already re-queued + // by driver. For more details, can check in SPARK-14485. + if (executorId.ne(null) && !executorIdToTaskCount.contains(executorId)) { --- End diff -- Thanks for your comments. I will fix it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12258: [SPARK-14485][CORE] ignore task finished for executor lo...
Github user zhonghaihua commented on the issue: https://github.com/apache/spark/pull/12258 @srowen @JoshRosen @andrewor14 Could someone verify this PR? Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14485][CORE] ignore task finished for e...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/12258#issuecomment-208881577 @JoshRosen @andrewor14 Could you verify this PR? Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14485][CORE] ignore task finished for e...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/12258#issuecomment-207296818 cc @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Ignore task finish for executor lost and remov...
GitHub user zhonghaihua opened a pull request: https://github.com/apache/spark/pull/12258 Ignore task finish for executor lost and removed by driver ## What changes were proposed in this pull request? Now, when executor is removed by driver with heartbeats timeout, driver will re-queue the task on this executor and send a kill command to cluster to kill this executor. But, in a situation, the running task of this executor is finished and return result to driver before this executor killed by kill command sent by driver. At this situation, driver will accept the task finished event and ignore speculative task and re-queued task. But, as we know, this executor has removed by driver, the result of this finished task can not save in driver because the BlockManagerId has also removed from BlockManagerMaster by driver. So, the result data of this stage is not complete, and then, it will cause fetch failure. For more details, [link to jira issues SPARK-14485](https://issues.apache.org/jira/browse/SPARK-14485) This PR introduce a mechanism to ignore this kind of task finished. ## How was this patch tested? None. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhonghaihua/spark ignoreTaskFinishForExecutorLostAndRemovedByDriver Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12258.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12258 commit d491ecd845e41e96aa94f36655860abb0dc70947 Author: zhonghaihua <793507...@qq.com> Date: 2016-04-08T06:16:05Z ignore task finished when its executor has already removed by driver commit c85cf990e84419fa10e3e78e2a32d6e015e46435 Author: zhonghaihua <793507...@qq.com> Date: 2016-04-08T07:40:07Z add comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-204715025 Hi, @tgravescs , my jira id is `Iward`. Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-204288682 @andrewor14 @tgravescs @vanzin The code and the comment is optimized. And the description of this PR and jira is also updated. Please review it again. Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r58042774 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator( new ConcurrentHashMap[ContainerId, java.lang.Boolean]) @volatile private var numExecutorsRunning = 0 - // Used to generate a unique ID per executor - private var executorIdCounter = 0 + + /** + * Used to generate a unique ID per executor + * + * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then + * the id of new executor will start from 1, this will conflict with the executor has --- End diff -- @tgravescs Ok, I will do it soon. Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r58032934 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -155,6 +158,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp // in this block are read when requesting executors CoarseGrainedSchedulerBackend.this.synchronized { executorDataMap.put(executorId, data) +if (currentExecutorIdCounter < Integer.parseInt(executorId)) { + currentExecutorIdCounter = Integer.parseInt(executorId) +} --- End diff -- @vanzin Thanks for your comments. I will optimize it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r58009178 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala --- @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages { case object RetrieveSparkProps extends CoarseGrainedClusterMessage + case object RetrieveMaxExecutorId extends CoarseGrainedClusterMessage --- End diff -- @tgravescs Okï¼I get your mean. Thanks a lot. Use this name `RetrieveLastAllocatedExecutorId` is ok ? @vanzin What's your opinion ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r57838823 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator( new ConcurrentHashMap[ContainerId, java.lang.Boolean]) @volatile private var numExecutorsRunning = 0 - // Used to generate a unique ID per executor - private var executorIdCounter = 0 + + /** + * Used to generate a unique ID per executor + * + * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then + * the id of new executor will start from 1, this will conflict with the executor has --- End diff -- I think we can clarify this in `SPARK-12864` issue. @andrewor14 What's your opinion ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r57838667 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator( new ConcurrentHashMap[ContainerId, java.lang.Boolean]) @volatile private var numExecutorsRunning = 0 - // Used to generate a unique ID per executor - private var executorIdCounter = 0 + + /** + * Used to generate a unique ID per executor + * + * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then + * the id of new executor will start from 1, this will conflict with the executor has + * already created before. So, we should initialize the `executorIdCounter` by getting + * the max executorId from driver. + * + * @see SPARK-12864 + */ + private var executorIdCounter: Int = { --- End diff -- @tgravescs Thanks your comment. Yes. I think in this situation, when Am comes up, if the executor has already registered, the execuotr information must set in `executorDataMap`. And in executor registers process, it has a check to make sure we don't allow executor to register with same id. You can see it in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L139 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r57837944 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala --- @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages { case object RetrieveSparkProps extends CoarseGrainedClusterMessage + case object RetrieveMaxExecutorId extends CoarseGrainedClusterMessage --- End diff -- @tgravescs Thanks for reviewing. Because this variables is the max executorId in previous AM, and this is just called by initializing AM. Our intention is to get the max executorId from all executor in previous AM. So I think maxExecutorId is ok. @andrewor14 What's your opinion ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-201880646 @andrewor14 @tgravescs @vanzin Could you verify this PR, or any thoughts or concerns for this ? Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-191830328 Hi @andrewor14 , any thoughts or concerns for this patch ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-189095532 Hi @andrewor14 , could you review this ? Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r54094456 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -78,6 +78,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp // Executors that have been lost, but for which we don't yet know the real exit reason. protected val executorsPendingLossReason = new HashSet[String] + // The num of current max ExecutorId used to re-register appMaster + var currentExecutorIdCounter = 0 --- End diff -- Hi @jerryshao , you are right. I fix it now. Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-188730628 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r54063643 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator( new ConcurrentHashMap[ContainerId, java.lang.Boolean]) @volatile private var numExecutorsRunning = 0 - // Used to generate a unique ID per executor - private var executorIdCounter = 0 + + /** + * Used to generate a unique ID per executor + * + * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then + * the id of new executor will start from 1, this will conflict with the executor has + * already created before. So, we should initialize the `executorIdCounter` by getting + * the max executorId from driver. + * + * @see SPARK-12864 + */ + private var executorIdCounter: Int = { +driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1 --- End diff -- I see, thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r54062244 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator( new ConcurrentHashMap[ContainerId, java.lang.Boolean]) @volatile private var numExecutorsRunning = 0 - // Used to generate a unique ID per executor - private var executorIdCounter = 0 + + /** + * Used to generate a unique ID per executor + * + * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then + * the id of new executor will start from 1, this will conflict with the executor has + * already created before. So, we should initialize the `executorIdCounter` by getting + * the max executorId from driver. + * + * @see SPARK-12864 + */ + private var executorIdCounter: Int = { +driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1 --- End diff -- Hi @jerryshao , I made a test just now. It can work in yarn cluster mode. From my understanding, `driverRef` is created by `ApplicationMaster`, so, I think at this time `driverEndpoint` is started. What's your opinion ? Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r54060168 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator( new ConcurrentHashMap[ContainerId, java.lang.Boolean]) @volatile private var numExecutorsRunning = 0 - // Used to generate a unique ID per executor - private var executorIdCounter = 0 + + /** + * Used to generate a unique ID per executor + * + * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then + * the id of new executor will start from 1, this will conflict with the executor has + * already created before. So, we should initialize the `executorIdCounter` by getting + * the max executorId from driver. + * + * @see SPARK-12864 + */ + private var executorIdCounter: Int = { +driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1 --- End diff -- Yes, but just in yarn-client mode. Is it necessary to test in yarn-cluster mode ? @jerryshao --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r54058736 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator( new ConcurrentHashMap[ContainerId, java.lang.Boolean]) @volatile private var numExecutorsRunning = 0 - // Used to generate a unique ID per executor - private var executorIdCounter = 0 + + /** + * Used to generate a unique ID per executor + * + * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then + * the id of new executor will start from 1, this will conflict with the executor has + * already created before. So, we should initialize the `executorIdCounter` by getting + * the max executorId from driver. + * + * @see SPARK-12864 + */ + private var executorIdCounter: Int = { +driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1 --- End diff -- Hi @jerryshao , thanks for reviewing. Allocating a new executor won't execute that code. It just request the executor id when `YarnAllocator` being created. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r54058368 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -78,6 +78,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp // Executors that have been lost, but for which we don't yet know the real exit reason. protected val executorsPendingLossReason = new HashSet[String] + // The num of current max ExecutorId used to re-register appMaster + var currentExecutorIdCounter = 0 --- End diff -- Hi @jerryshao , thanks for your comments. The master branch is different from branch-1.5.x version. In master branch,`CoarseGrainedSchedulerBackend` is belong to module `core` and `YarnSchedulerBackend` is belong to module `yarn` , while in branch-1.5.x version it is belong to the same package. So, from my understanding, `protected` is unsuited here, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r53919614 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -169,6 +172,24 @@ private[yarn] class YarnAllocator( } /** + * Init `executorIdCounter` + */ + def initExecutorIdCounter(): Unit = { +val port = sparkConf.getInt("spark.yarn.am.port", 0) +SparkHadoopUtil.get.runAsSparkUser { () => + val init = RpcEnv.create( +"executorIdCounterInit", +Utils.localHostName, +port, +sparkConf, +new SecurityManager(sparkConf)) + val driver = init.setupEndpointRefByURI(driverUrl) --- End diff -- Hi @jerryshao , thanks for your comments. I see what you mean, I will fix it soon. Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-186993839 Hi @andrewor14 , the reason of test failed seems `GitException`. Could you retest it ? Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r53579600 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -169,6 +172,24 @@ private[yarn] class YarnAllocator( } /** + * Init `executorIdCounter` + */ + def initExecutorIdCounter(): Unit = { +val port = sparkConf.getInt("spark.yarn.am.port", 0) +SparkHadoopUtil.get.runAsSparkUser { () => + val init = RpcEnv.create( +"executorIdCounterInit", +Utils.localHostName, +port, +sparkConf, +new SecurityManager(sparkConf)) + val driver = init.setupEndpointRefByURI(driverUrl) --- End diff -- Hi @andrewor14 , `driverRef` doesn't work in this case. Because, for my understanding, `driverRef` which endpoint name called `YarnScheduler` send message to `YarnSchedulerEndpoint` (or get message from `YarnSchedulerEndpoint`), while we should get max executorId from `CoarseGrainedSchedulerBackend.DriverEndpoint` which endpoint name called `CoarseGrainedScheduler`. So, I think we should need a method to initialize `executorIdCounter`. And as you said, we should add huge comment huge comment related to SPARK-12864 to explain why we need to do this at this method. Whatâs your opinion ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r53562669 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -155,6 +158,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp // in this block are read when requesting executors CoarseGrainedSchedulerBackend.this.synchronized { executorDataMap.put(executorId, data) +if (currentExecutorIdCounter < Integer.parseInt(executorId)) { + currentExecutorIdCounter = Integer.parseInt(executorId) +} --- End diff -- Thank for review it. For my understanding, I don't think we can get the max executor ID in executorDataMap. Because, when AM is failure, all the executor are disconnect and be removed, by this time, as the code in method `CoarseGrainedSchedulerBackend.removeExecutor` show, the executor information in executorDataMap also be removed. So, I think the executor information in executorDataMap is not complete. What do you think ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/10794#discussion_r53562543 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala --- @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages { case object RetrieveSparkProps extends CoarseGrainedClusterMessage + case object RetrieveCurrentExecutorIdCounter extends CoarseGrainedClusterMessage --- End diff -- @andrewor14 Yeah, you are right. I will fix it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-186016863 Hi, @andrewor14 , I agree with @jerryshao , I think that is not related to it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-185625228 @andrewor14 @marmbrus @rxin , any thoughts or concerns for this patch ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-179924619 @andrewor14 Thanks for review it. Could this path merge to master ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-174830247 @marmbrus @liancheng @yhuai Could you verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: initialize executorIdCounter after Application...
GitHub user zhonghaihua opened a pull request: https://github.com/apache/spark/pull/10794 initialize executorIdCounter after ApplicationMaster killed for max n⦠Currently, when max number of executor failures reached the `maxNumExecutorFailures`, `ApplicationMaster` will be killed and re-register another one.This time, `YarnAllocator` will be created a new instance. But, the value of property `executorIdCounter` in `YarnAllocator` will reset to `0`. Then the Id of new executor will starting from `1`. This will confuse with the executor has already created before, which will cause FetchFailedException. This PR introduce a mechanism to initialize `executorIdCounter` after `ApplicationMaster` killed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhonghaihua/spark initExecutorIdCounterAfterAMKilled Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10794.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10794 commit 30048ac7ac9fc95edc1b936076415dea335848ef Author: zhonghaihua <793507...@qq.com> Date: 2016-01-17T12:46:44Z initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-172323185 cc @rxin @marmbrus @chenghao-intel @jeanlyn could you give some advice ï¼ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169231327 @marmbrus Thanks for your suggestions. I think your idea can simply solve problem. But in some situations, this seems not very suitable. For example: `Join(testRelation, testRelation2, Inner,Some(And(EqualTo(a, b), EqualTo(Rand(33) * c, d` If `c` is an attribute belong to `testRelation2`, I think `Rand(33)` is more appropriately pulled out to the right child tree of `Join`, otherwise, if belong to `testRelation`, it is appropriately pulled out to left child tree. When `nondeterministic expressions` is used with `table attribute`, I think pull out it should depend on the attribute. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-168892937 @rxin @yhuai Could you verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-161969421 @cloud-fan Thanks for your advice. But, as @jeanlyn said,`Repartition` will deal with all data, and this PR will deal with join keys cause data skew. Because in some situations, we will use this operator to avoid data skew in `SQL`, then I think maybe we should support this. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: pull out nondeterministic expressions from Joi...
GitHub user zhonghaihua opened a pull request: https://github.com/apache/spark/pull/10128 pull out nondeterministic expressions from Join Currently,`nondeterministic expressions` are only allowed in `Project` or `Filter`,And only when we use nondeterministic expressions in `UnaryNode` can be pulled out. But,Sometime in many case,we will use nondeterministic expressions to process `join keys` avoiding data skew.for example: ``` select * from tableA a join (select * from tableB) b on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( (-rand() * 1000 ) as string ) else a.brand_code end )) = b.brand_code ``` This PR introduce a mechanism to pull out nondeterministic expressions from `Join`,so we can use nondeterministic expression in `Join` appropriately. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhonghaihua/spark pulloutJoinNondeterministic Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10128.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10128 commit 6e166578a5c1a1faf260389509663ac8c71ec015 Author: zhonghaihua <793507...@qq.com> Date: 2015-11-30T07:44:49Z pull out nondeterministic expressions from Join pull out nondeterministic expressions from Join pull out nondeterministic expressions from Join --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: pull out nondeterministic expressions from Joi...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-161590938 @rxin @cloud-fan @chenghao-intel @jeanlyn Could you give some suggestions on this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Join nondeterministic
GitHub user zhonghaihua opened a pull request: https://github.com/apache/spark/pull/10122 Join nondeterministic You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhonghaihua/spark join_nondeterministic Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10122.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10122 commit 6d8ebc801799714d297c83be6935b37e26dc2df7 Author: Xiangrui Meng <m...@databricks.com> Date: 2015-08-26T05:35:49Z [SPARK-10243] [MLLIB] update since versions in mllib.tree Same as #8421 but for `mllib.tree`. cc jkbradley Author: Xiangrui Meng <m...@databricks.com> Closes #8442 from mengxr/SPARK-10236. (cherry picked from commit fb7e12fe2e14af8de4c206ca8096b2e8113bfddc) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit 08d390f457f80ffdc2dfce61ea579d9026047f12 Author: Xiangrui Meng <m...@databricks.com> Date: 2015-08-26T05:49:33Z [SPARK-10235] [MLLIB] update since versions in mllib.regression Same as #8421 but for `mllib.regression`. cc freeman-lab dbtsai Author: Xiangrui Meng <m...@databricks.com> Closes #8426 from mengxr/SPARK-10235 and squashes the following commits: 6cd28e4 [Xiangrui Meng] update since versions in mllib.regression (cherry picked from commit 4657fa1f37d41dd4c7240a960342b68c7c591f48) Signed-off-by: DB Tsai <d...@netflix.com> commit 21a10a86d20ec1a6fea42286b4d2aae9ce7e848d Author: Xiangrui Meng <m...@databricks.com> Date: 2015-08-26T06:45:41Z [SPARK-10236] [MLLIB] update since versions in mllib.feature Same as #8421 but for `mllib.feature`. cc dbtsai Author: Xiangrui Meng <m...@databricks.com> Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits: 0e8d658 [Xiangrui Meng] remove unnecessary comment ad70b03 [Xiangrui Meng] update since versions in mllib.feature (cherry picked from commit 321d7759691bed9867b1f0470f12eab2faa50aff) Signed-off-by: DB Tsai <d...@netflix.com> commit 5220db9e352b5d5eae59cead9478ca0a9f73f16b Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-08-26T06:48:16Z [SPARK-9316] [SPARKR] Add support for filtering using `[` (synonym for filter / select) Add support for ``` df[df$name == "Smith", c(1,2)] df[df$age %in% c(19, 30), 1:2] ``` shivaram Author: felixcheung <felixcheun...@hotmail.com> Closes #8394 from felixcheung/rsubset. (cherry picked from commit 75d4773aa50e24972c533e8b48697fde586429eb) Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu> commit b0dde36009ce371824ce3e47e60fa0711d7733bb Author: Xiangrui Meng <m...@databricks.com> Date: 2015-08-26T18:47:05Z [SPARK-9665] [MLLIB] audit MLlib API annotations I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs. cc jkbradley Author: Xiangrui Meng <m...@databricks.com> Closes #8452 from mengxr/SPARK-9665. (cherry picked from commit 6519fd06cc8175c9182ef16cf8a37d7f255eb846) Signed-off-by: Joseph K. Bradley <jos...@databricks.com> commit efbd7af44e855efcbb1fa224e80db24947e2b153 Author: Xiangrui Meng <m...@databricks.com> Date: 2015-08-26T21:02:19Z [SPARK-10241] [MLLIB] update since versions in mllib.recommendation Same as #8421 but for `mllib.recommendation`. cc srowen coderxiang Author: Xiangrui Meng <m...@databricks.com> Closes #8432 from mengxr/SPARK-10241. (cherry picked from commit 086d4681df3ebfccfc04188262c10482f44553b0) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit 0bdb800575ae2872e2655983a1be94dcf2e8c36b Author: Davies Liu <dav...@databricks.com> Date: 2015-08-26T23:04:44Z [SPARK-10305] [SQL] fix create DataFrame from Python class cc jkbradley Author: Davies Liu <dav...@databricks.com> Closes #8470 from davies/fix_create_df. (cherry picked from commit d41d6c48207159490c1e1d9cc54015725cfa41b2) Signed-off-by: Davies Liu <davies@gmail.com> commit cef707d2185ca7e0c5635fabe709d5e26915b5bb Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu> Date: 2015-08-27T01:13:07Z [SPARK-10308] [SPARKR] Add %in% to the exported namespace I also checked all the other functions defined in column.R, functions.R and DataFrame.R and everything else looked fine. cc yu-iskw Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu> Closes #8473 from
[GitHub] spark pull request: Join nondeterministic
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10122#issuecomment-161533958 I am so sorry to create this pull request, this pr is not on the right branch.I will close it right now. This is my mistake, cause trouble, very sorry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Join nondeterministic
Github user zhonghaihua closed the pull request at: https://github.com/apache/spark/pull/10122 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11517][SQL]Calc partitions in parallel ...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/9483#issuecomment-156810307 Hi @zhichao-li ,thanks for doing this.I got a problem of scanning partitions slowly,and I apply this patch to my spark version.In my case: * Before I apply this patch,it takes at least 3 or 4 minutes to scan partitions. * After applying this patch,it takes only about 20 seconds at this stage. I am happy to see it takes effect in my case.It solve my problem.And I think is it better to add conf to control whether to use this featureï¼ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11517][SQL]Calc partitions in parallel ...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/9483#issuecomment-156810714 Hi @zhichao-li ,thanks for doing this.I got a problem of scanning partitions slowly,and I apply this patch to my spark version.In my case: * Before I apply this patch,it takes at least 3 or 4 minutes to scan partitions. * After applying this patch,it takes only about 20 seconds at this stage. I am happy to see it takes effect in my case.It solve my problem.And I think is it better to add conf to control whether to use this featureï¼ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org