[GitHub] spark pull request #21527: [SPARK-24519] Make the threshold for highly compr...
Github user hthuynh2 commented on a diff in the pull request: https://github.com/apache/spark/pull/21527#discussion_r218656012 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -50,7 +50,9 @@ private[spark] sealed trait MapStatus { private[spark] object MapStatus { def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = { -if (uncompressedSizes.length > 2000) { +if (uncompressedSizes.length > Option(SparkEnv.get) --- End diff -- How about creating a "static" val shuffleMinNumPartsToHighlyCompress for this? Please let me know if this is good for you so I can update it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21953: [SPARK-24992][Core] spark should randomize yarn local di...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21953 @tgravescs I updated it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21936: [SPARK-24981][Core] ShutdownHook timeout causes job to f...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21936 @tgravescs I updated. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21953: [SPARK-24992][Core] spark should randomize yarn local di...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21953 @tgravescs Can you test this please? Thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21953: [SPARK-24992][Core] spark should randomize yarn l...
GitHub user hthuynh2 opened a pull request: https://github.com/apache/spark/pull/21953 [SPARK-24992][Core] spark should randomize yarn local dir selection **Description: [SPARK-24992](https://issues.apache.org/jira/browse/SPARK-24992)** Utils.getLocalDir is used to get path of a temporary directory. However, it always returns the the same directory, which is the first element in the array localRootDirs. When running on YARN, this might causes the case that we always write to one disk, which makes it busy while other disks are free. We should randomize the selection to spread out the loads. **What changes were proposed in this pull request?** This PR randomized the selection of local directory inside the method Utils.getLocalDir. This change affects the Utils.fetchFile method since it based on the fact that Utils.getLocalDir always return the same directory to cache file. Therefore, a new variable cachedLocalDir is used to cache the first localDirectory that it gets from Utils.getLocalDir. Also, when getting the configured local directories (inside Utils. getConfiguredLocalDirs), in case we are in yarn mode, the array of directories are also randomized before return. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hthuynh2/spark SPARK_24992 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21953.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21953 commit 3986e75c3c000e7a7e7674be6837d663499f35f1 Author: Hieu Huynh <âhieu.huynh@...> Date: 2018-08-01T22:31:48Z initial commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21936: [SPARK-24981][Core] ShutdownHook timeout causes j...
Github user hthuynh2 commented on a diff in the pull request: https://github.com/apache/spark/pull/21936#discussion_r206918869 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -571,7 +571,12 @@ class SparkContext(config: SparkConf) extends Logging { _shutdownHookRef = ShutdownHookManager.addShutdownHook( ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () => logInfo("Invoking stop() from shutdown hook") - stop() + try { +stop() + } catch { +case e: Throwable => + logWarning("Ignoring Exception while stoping SparkContext. Exception: " + e) --- End diff -- @felixcheung Thanks for the comments. I updated it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21936: [SPARK-24981][Core] ShutdownHook timeout causes job to f...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21936 @tgravescs Can you please have a look at it? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21936: [SPARK-24981][Core] ShutdownHook timeout causes j...
GitHub user hthuynh2 opened a pull request: https://github.com/apache/spark/pull/21936 [SPARK-24981][Core] ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program **Description** The issue is described in [SPARK-24981](https://issues.apache.org/jira/browse/SPARK-24981). **How does this PR fix the issue?** This PR catch the Exception that is thrown while the Sparkcontext.stop() is running (when it is called by the ShutdownHookManager). **How was this patch tested?** I manually tested it by adding delay (60s) inside the stop(). This make the shutdownHookManger interrupt the thread that is running stop(). The Interrupted Exception was catched and the job succeed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hthuynh2/spark SPARK_24981 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21936.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21936 commit 101bf2b0488ab495dbdaf7b54d9a57d9827cb833 Author: Hieu Huynh <âhieu.huynh@...> Date: 2018-07-31T18:30:11Z initial commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21653 @tgravescs Can you please run the test again, thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21653 @tgravescs I updated it. Can you please have a look at it when you have time. Thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21729: [SPARK-24755][Core] Executor loss can cause task ...
Github user hthuynh2 commented on a diff in the pull request: https://github.com/apache/spark/pull/21729#discussion_r201412205 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -87,7 +87,7 @@ private[spark] class TaskSetManager( // Set the coresponding index of Boolean var when the task killed by other attempt tasks, --- End diff -- I'll update it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21729: [SPARK-24755][Core] Executor loss can cause task ...
Github user hthuynh2 commented on a diff in the pull request: https://github.com/apache/spark/pull/21729#discussion_r201411743 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -87,7 +87,7 @@ private[spark] class TaskSetManager( // Set the coresponding index of Boolean var when the task killed by other attempt tasks, // this happened while we set the `spark.speculation` to true. The task killed by others // should not resubmit while executor lost. - private val killedByOtherAttempt: Array[Boolean] = new Array[Boolean](numTasks) + private val killedByOtherAttempt = new HashSet[Long] --- End diff -- I think we should use ArrayBuffer[Long] instead of Array[Long] because the number of elements can grow when there are more killed attempts. Also, I think there is a downside of using Array-like data structure for this variable. Lookup operation for array-like data structure takes linear time and that operation is used many times when we check if a task need to be resubmitted (inside executorLost method of TSM). This will not matter much if the size of the array is small, but still I think this is something we might want to consider. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21729: [SPARK-24755][Core] Executor loss can cause task ...
Github user hthuynh2 commented on a diff in the pull request: https://github.com/apache/spark/pull/21729#discussion_r201372921 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -87,7 +87,7 @@ private[spark] class TaskSetManager( // Set the coresponding index of Boolean var when the task killed by other attempt tasks, // this happened while we set the `spark.speculation` to true. The task killed by others // should not resubmit while executor lost. - private val killedByOtherAttempt: Array[Boolean] = new Array[Boolean](numTasks) + private val killedByOtherAttempt = new HashSet[Long] --- End diff -- Also, the comment "Set the corresponding index of Boolean var when the task killed ..." is not correct anymore. I'm sorry I forgot to update it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21729: [SPARK-24755][Core] Executor loss can cause task ...
Github user hthuynh2 commented on a diff in the pull request: https://github.com/apache/spark/pull/21729#discussion_r201371752 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -87,7 +87,7 @@ private[spark] class TaskSetManager( // Set the coresponding index of Boolean var when the task killed by other attempt tasks, // this happened while we set the `spark.speculation` to true. The task killed by others // should not resubmit while executor lost. - private val killedByOtherAttempt: Array[Boolean] = new Array[Boolean](numTasks) + private val killedByOtherAttempt = new HashSet[Long] --- End diff -- Hi @jiangxb1987, thanks for the comment, but I'm sure if I understand your suggestion correctly. Do you mean: private val killedByOtherAttempt = new Array[Long] ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21729: [SPARK-24755][Core] Executor loss can cause task to not ...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21729 @xuanyuanking Thanks for the comments. I also thought about modifying the UT of SPARK-22074 instead of adding new UT but I was afraid it might cause confusing since they are 2 different issues although they are very close. If you feel it is better to combine them, I can change it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21729: SPARK-24755 Executor loss can cause task to not be resub...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21729 cc @mridulm @xuanyuanking --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21729: SPARK-24755 Executor loss can cause task to not b...
GitHub user hthuynh2 opened a pull request: https://github.com/apache/spark/pull/21729 SPARK-24755 Executor loss can cause task to not be resubmitted **Description** As described in [SPARK-24755](https://issues.apache.org/jira/browse/SPARK-24755), when speculation is enabled, there is scenario that executor loss can cause task to not be resubmitted. This patch changes the variable killedByOtherAttempt to keeps track of the taskId of tasks that are killed by other attempt. By doing this, we can still prevent resubmitting task killed by other attempt while resubmit successful attempt when executor lost. **How was this patch tested?** A UT is added based on the UT written by @xuanyuanking with modification to simulate the scenario described in SPARK-24755. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hthuynh2/spark SPARK_24755 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21729.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21729 commit 093e39cf76378821284ef7d771e819afb69930ae Author: Hieu Huynh <âhieu.huynh@...> Date: 2018-07-08T18:20:26Z SPARK-24755 Executor loss can cause task to not be resubmitted --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21653 @squito Thanks for the suggestions. I updated it. Could you please have a look at it to see if there is anything else I need to change? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21653 @jiangxb1987 yes, you are correct that it is actually ignored. I think it doesn't worth to add a new TaskState because we might need to add changes in many places but does not add much benefit. Instead, I think we can add some message to the kill message to differentiate it from task that is actually killed and to inform the user. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21653 I updated it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21653 cc @tgravescs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21653: [SPARK-13343] speculative tasks that didn't commi...
GitHub user hthuynh2 opened a pull request: https://github.com/apache/spark/pull/21653 [SPARK-13343] speculative tasks that didn't commit shouldn't be marked as success **Description** Currently Speculative tasks that didn't commit can show up as success of failures (depending on timing of commit). This is a bit confusing because that task didn't really succeed in the sense it didn't write anything. I think these tasks should be marked as KILLED or something that is more obvious to the user exactly what happened. it is happened to hit the timing where it got a commit denied exception then it shows up as failed and counts against your task failures. It shouldn't count against task failures since that failure really doesn't matter. MapReduce handles these situation so perhaps we can look there for a model. https://user-images.githubusercontent.com/15680678/42013170-99db48c2-7a61-11e8-8c7b-ef94c84e36ea.png;> **How can this issue happen?** When both attempts of a task finish before the driver sends command to kill one of them, both of them send the status update FINISHED to the driver. The driver calls TaskSchedulerImpl to handle one successful task at a time. When it handles the first successful task, it sends the command to kill the other copy of the task, however, because that task is already finished, the executor will ignore the command. After finishing handling the first attempt, it processes the second one, although all actions on the result of this task are skipped, this copy of the task is still marked as SUCCESS. As a result, even though this issue does not affect the result of the job, it might cause confusing to user because both of them appear to be successful. **How does this PR fix the issue?** The simple way to fix this issue is that when taskSetManager handles successful task, it checks if any other attempt succeeded. If this is the case, it will call handleFailedTask with state==KILLED and reason==TaskKilled(âanother attempt succeededâ) to handle this task as begin killed. **How was this patch tested?** I tested this manually by running applications, that caused the issue before, a few times, and observed that the issue does not happen again. Also, I added a unit test in TaskSetManagerSuite to test that if we call handleSuccessfulTask to handle status update for 2 copies of a task, only the one that is handled first will be mark as SUCCESS You can merge this pull request into a Git repository by running: $ git pull https://github.com/hthuynh2/spark SPARK_13343 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21653.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21653 commit 8f7d98177816e11659cf79a2b28f96bd4b7173d5 Author: Hieu Huynh <âhieu.huynh@...> Date: 2018-06-28T04:19:14Z Fixed issue and added unit test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] Make the threshold for highly compressed m...
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21527 @vanzin @squito Thank for the comments. I updated the PR, please have a look and let me know if anything need to be changed. Thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21527 @tgravescs I updated it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21527 I updated it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded
Github user hthuynh2 commented on the issue: https://github.com/apache/spark/pull/21527 @tgravescs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21527: Spark branch 1
GitHub user hthuynh2 opened a pull request: https://github.com/apache/spark/pull/21527 Spark branch 1 **Problem** MapStatus uses hardcoded value of 2000 partitions to determine if it should use highly compressed map status. We should make it configurable. **What changes were proposed in this pull request?** I make the hardcoded value mentioned above to be configurable under the name _SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS_, which has default value to be 2000. Users can set it to the value they want by setting the property name _spark.shuffle.minNumPartitionsToHighlyCompress_ **How was this patch tested?** I wrote a unit test to make sure that the default value is 2000, and _IllegalArgumentException_ will be thrown if user set it to a non-positive value. The unit test also checks that highly compressed map status is correctly used when the number of partition is greater than _SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS_. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hthuynh2/spark spark_branch_1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21527.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21527 commit 93582bd1ce114368654ff896749c517d979ed23a Author: Hieu Huynh <âhieu.huynh@...> Date: 2018-06-11T13:47:02Z Change MapStatus hardcode value to configurable commit d3f24b501c68f8ef22726d711a887268d02a9fc7 Author: Hieu Huynh <âhieu.huynh@...> Date: 2018-06-11T14:16:25Z Fixed incorrect name --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org