[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Al M updated SPARK-24474: ------------------------- Description: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled * The stage that hangs (referred to as "Second stage" above) has a lower 'Stage Id' than the first one that completes * This happens with spark.shuffle.service.enabled set to true and false was: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled * The stage that hangs (referred to as "Second stage" above) has a lower 'Stage Id' than the first one that completes > Cores are left idle when there are a lot of stages to run > --------------------------------------------------------- > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.2.0 > Reporter: Al M > Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled > * The stage that hangs (referred to as "Second stage" above) has a lower > 'Stage Id' than the first one that completes > * This happens with spark.shuffle.service.enabled set to true and false -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org