[ https://issues.apache.org/jira/browse/SPARK-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-13718: ------------------------------ Summary: Scheduler "creating" straggler node (was: Scheduler "creating" struggler node ) > Scheduler "creating" straggler node > ------------------------------------ > > Key: SPARK-13718 > URL: https://issues.apache.org/jira/browse/SPARK-13718 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core > Affects Versions: 1.3.1 > Environment: Spark 1.3.1 > MapR-FS > Single Rack > Standalone mode scheduling > 8 node cluster > 48 cores & 512 RAM / node > Data Replication factor of 3 > Each Node has 4 Spark executors configured with 12 cores each and 22GB of RAM. > Reporter: Ioannis Deligiannis > Priority: Critical > > *Data:* > * Assume an even distribution of data across the cluster with a replication > factor of 3. > * In-memory data are partitioned in 128 chunks (384 cores in total, i.e. 3 > requests can be executed concurrently(-ish) ) > *Action:* > * Action is a simple sequence of map/filter/reduce. > * The action operates upon and returns a small subset of data (following the > full map over the data). > * Data are 1 x cached serialized in memory (Kryo), so calling the action > should not hit the disk under normal conditions. > * Action network usage is low as it returns a small number of aggregated > results and does not require excessive shuffling > * Under low or moderate load, each action is expected to complete in less > than 2 seconds > *H/W Outlook* > When the action is called in high numbers, initially the cluster CPU gets > close to 100% (which is expected & intended). > After a while, the cluster utilization reduces significantly with only one > (struggler) node having 100% CPU and fully utilized network. > *Diagnosis:* > 1. Attached a profiler to the driver and executors to monitor GC or I/O > issues and everything is normal under low or heavy usage. > 2. Cluster network usage is very low > 3. No issues on Spark UI except that tasks begin to move from LOCAL to ANY > *Cause :* > 1. Node 'H' is doing marginally more work than the rest (being a little > slower and at almost 100% CPU) > 2. Scheduler hits the default 3000 millis spark.locality.wait and assigns the > task to other nodes (In some cases it will assign to NODE which means load > from HDD and then follow the sequence and fallback to ANY) > 3. One of the nodes 'X' that accepted the task will eventually try to access > the data from node 'H' HDD. This adds HDD and Network I/O to node and also > some extra CPU for I/O. > 4. 'X' time to complete increases ~5x as it involves HDD/Network > 5. Eventually, every node has a task that is waiting to fetch that specific > partition from node 'H' so cluster is basically blocked on a single node > * Proposed Fix * > I have not worked with Scala enough to propose a code fix, but on a high > level, when a task hits the 'spark.locality.wait' timeout, it should provide > a 'hint' to the node accepting the task to use as a data source 'replica' > that is not on the node that failed to accept the task in the first place. > *Workaround* > Playing with 'spark.locality.wait' values, there is a deterministic value > depending on partitions and config where the problem ceases to exist. > *PS1* : Don't have enough Scala skils to follow the issue or propose a fix, > but I hope that this has enough information to make sense. > *PS2* : Debugging this issue made me realize that there can be a lot of > use-cases that trigger this behaviour -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org