[ 
https://issues.apache.org/jira/browse/SPARK-34154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272990#comment-17272990
 ] 

Attila Zsolt Piros commented on SPARK-34154:
--------------------------------------------

In my PR the bug does not surfaced even after 200 runs.

I think the root cause could be something with the hostname lookup, I have 
checked and it is called from here:

{code:java}
        at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java)
        at org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:585)
        at 
org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109)
        at 
org.apache.spark.deploy.yarn.SparkRackResolver.coreResolve(SparkRackResolver.scala:75)
        at 
org.apache.spark.deploy.yarn.SparkRackResolver.resolve(SparkRackResolver.scala:66)
        at 
org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.$anonfun$localityOfRequestedContainers$3(LocalityPreferredContainerPlacementStrategy.scala:142)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
        at 
org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.localityOfRequestedContainers(LocalityPreferredContainerPlacementStrategy.scala:138)
        at 
org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.org$apache$spark$deploy$yarn$LocalityPlacementStrategySuite$$runTest(LocalityPlacementStrategySuite.scala:94)
        at 
org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite$$anon$1.run(LocalityPlacementStrategySuite.scala:40)
        at java.lang.Thread.run(Thread.java:748)
{code}

To get more information I will modify my code a bit and make it mergeable. This 
way when it fails again in somebody's PR we will have at least the stack trace. 
In addition it will fail fast. It won't run for hour(s) just for 30 seconds.


> Flaky Test: LocalityPlacementStrategySuite.handle large number of containers 
> and tasks (SPARK-18750)
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34154
>                 URL: https://issues.apache.org/jira/browse/SPARK-34154
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 3.0.2, 3.2.0, 3.1.1
>            Reporter: Dongjoon Hyun
>            Priority: Major
>
> `LocalityPlacementStrategySuite` hangs sometimes like the following. We can 
> retriever, but it takes our resource significantly because it hangs until the 
> timeout (6 hours) occurs.
> [https://github.com/apache/spark/runs/1719480243]
> [https://github.com/apache/spark/runs/1724459002]
> [https://github.com/apache/spark/runs/1717958874]
> [https://github.com/apache/spark/runs/1731673955] (branch-3.0)
> {code:java}
> [info] LocalityPlacementStrategySuite:
> 17299[info] *** Test still running after 3 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17300[info] *** Test still running after 8 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17301[info] *** Test still running after 13 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17302[info] *** Test still running after 18 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17303[info] *** Test still running after 23 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17304[info] *** Test still running after 28 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17305[info] *** Test still running after 33 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17306[info] *** Test still running after 38 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17307[info] *** Test still running after 43 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17308[info] *** Test still running after 48 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17309[info] *** Test still running after 53 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17310[info] *** Test still running after 58 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17311[info] *** Test still running after 1 hour, 3 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17312[info] *** Test still running after 1 hour, 8 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17313[info] *** Test still running after 1 hour, 13 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17314[info] *** Test still running after 1 hour, 18 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17315[info] *** Test still running after 1 hour, 23 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17316[info] *** Test still running after 1 hour, 28 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17317[info] *** Test still running after 1 hour, 33 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17318[info] *** Test still running after 1 hour, 38 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17319[info] *** Test still running after 1 hour, 43 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17320[info] *** Test still running after 1 hour, 48 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17321[info] *** Test still running after 1 hour, 53 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17322[info] *** Test still running after 1 hour, 58 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17323[info] *** Test still running after 2 hours, 3 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17324[info] *** Test still running after 2 hours, 8 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17325[info] *** Test still running after 2 hours, 13 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17326[info] *** Test still running after 2 hours, 18 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17327[info] *** Test still running after 2 hours, 23 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17328[info] *** Test still running after 2 hours, 28 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17329[info] *** Test still running after 2 hours, 33 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17330[info] *** Test still running after 2 hours, 38 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17331[info] *** Test still running after 2 hours, 43 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17332[info] *** Test still running after 2 hours, 48 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17333[info] *** Test still running after 2 hours, 53 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17334[info] *** Test still running after 2 hours, 58 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17335[info] *** Test still running after 3 hours, 3 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17336[info] *** Test still running after 3 hours, 8 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17337[info] *** Test still running after 3 hours, 13 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17338[info] *** Test still running after 3 hours, 18 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17339[info] *** Test still running after 3 hours, 23 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17340[info] *** Test still running after 3 hours, 28 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17341[info] *** Test still running after 3 hours, 33 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17342[info] *** Test still running after 3 hours, 38 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17343[info] *** Test still running after 3 hours, 43 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17344[info] *** Test still running after 3 hours, 48 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17345[info] *** Test still running after 3 hours, 53 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17346[info] *** Test still running after 3 hours, 58 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17347[info] *** Test still running after 4 hours, 3 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17348[info] *** Test still running after 4 hours, 8 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17349[info] *** Test still running after 4 hours, 13 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17350[info] *** Test still running after 4 hours, 18 minutes, 6 seconds: 
> suite name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750).  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to