[ https://issues.apache.org/jira/browse/SPARK-19226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maheedhar Reddy Chappidi updated SPARK-19226: --------------------------------------------- Description: With the exponential[1] increase in executor count the Reporter thread [2] fails without proper message. == 17/01/12 09:33:44 INFO YarnAllocator: Driver requested a total number of 32767 executor(s). 17/01/12 09:33:44 INFO YarnAllocator: Will request 24576 executor containers, each with 2 cores and 5632 MB memory including 512 MB overhead 17/01/12 09:33:44 INFO YarnAllocator: Canceled 0 container requests (locality no longer needed) 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34419 executor(s). 17/01/12 09:33:52 INFO ApplicationMaster: Final app status: FAILED, exitCode: 12, (reason: Exception was thrown 1 time(s) from Reporter thread.) 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34410 executor(s). 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34409 executor(s). 17/01/12 09:33:52 INFO ShutdownHookManager: Shutdown hook called == We were able to run the workflows by setting/limiting the maxExecutor count (spark.dynamicAllocation.maxExecutors) to avoid more requests(35k->65k). Added I don't see any issues with ApplicationMaster's container memory/compute. Is it possible to parse more ErrorReason from if/else? [1] https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala [2] https://github.com/apache/spark/blob/01e14bf303e61a5726f3b1418357a50c1bf8b16f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L446-L480 was: With the exponential[1] increase in executor count the Reporter thread [2] fails without proper message. == 17/01/12 09:33:44 INFO YarnAllocator: Driver requested a total number of 32767 executor(s). 17/01/12 09:33:44 INFO YarnAllocator: Will request 24576 executor containers, each with 2 cores and 5632 MB memory including 512 MB overhead 17/01/12 09:33:44 INFO YarnAllocator: Canceled 0 container requests (locality no longer needed) 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34419 executor(s). 17/01/12 09:33:52 INFO ApplicationMaster: Final app status: FAILED, exitCode: 12, (reason: Exception was thrown 1 time(s) from Reporter thread.) 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34410 executor(s). 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of 34409 executor(s). 17/01/12 09:33:52 INFO ShutdownHookManager: Shutdown hook called == We were able to run the workflows by setting/limiting the maxExecutor count (spark.dynamicAllocation.maxExecutors) to avoid more requests(35k->65k). Added I don't see any issues with ApplicationMaster's container memory/compute. [1] https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala [2] https://github.com/apache/spark/blob/01e14bf303e61a5726f3b1418357a50c1bf8b16f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L446-L480 > Report failure reason from Reporter Thread > ------------------------------------------- > > Key: SPARK-19226 > URL: https://issues.apache.org/jira/browse/SPARK-19226 > Project: Spark > Issue Type: Improvement > Components: YARN > Affects Versions: 2.0.2 > Environment: emr-5.2.1 with Zeppelin 0.6.2/Spark2.0.2 and 10 r3.xl > core nodes > Reporter: Maheedhar Reddy Chappidi > Priority: Minor > > With the exponential[1] increase in executor count the Reporter thread [2] > fails without proper message. > == > 17/01/12 09:33:44 INFO YarnAllocator: Driver requested a total number of > 32767 executor(s). > 17/01/12 09:33:44 INFO YarnAllocator: Will request 24576 executor containers, > each with 2 cores and 5632 MB memory including 512 MB overhead > 17/01/12 09:33:44 INFO YarnAllocator: Canceled 0 container requests (locality > no longer needed) > 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of > 34419 executor(s). > 17/01/12 09:33:52 INFO ApplicationMaster: Final app status: FAILED, exitCode: > 12, (reason: Exception was thrown 1 time(s) from Reporter thread.) > 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of > 34410 executor(s). > 17/01/12 09:33:52 INFO YarnAllocator: Driver requested a total number of > 34409 executor(s). > 17/01/12 09:33:52 INFO ShutdownHookManager: Shutdown hook called > == > We were able to run the workflows by setting/limiting the maxExecutor count > (spark.dynamicAllocation.maxExecutors) to avoid more requests(35k->65k). > Added I don't see any issues with ApplicationMaster's container > memory/compute. > Is it possible to parse more ErrorReason from if/else? > [1] > https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala > [2] > https://github.com/apache/spark/blob/01e14bf303e61a5726f3b1418357a50c1bf8b16f/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L446-L480 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org