[ https://issues.apache.org/jira/browse/SPARK-34063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Calvin Pietersen updated SPARK-34063: ------------------------------------- Description: Spark streaming application runs at 60s batch intervals. The application runs fine processing batches around 40s. After ~8600 batches (around 6 days), the application all of a sudden hits a wall and processing time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This happens consistently every 6 days, regardless of data. Looking at the application logs, it seems like when the issue begins, tasks are being completed by executors, however the driver is taking a while to acknowledge. I have taken numerous memory dumps of the driver (before it hits the 6 day wall) using *jcmd* and can see the *org.apache.spark.scheduler.AsyncEventQueue* is growing in size despite the fact that the application is able to keep up with batches. I have yet to take a snapshot of the application in the broken stage. was: Spark streaming application runs at 60s batch intervals. The application runs fine processing batches around 40s. After ~8600 batches (around 6 days), the application all of a sudden hits a wall and processing time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This happens consistently every 6 days, regardless of data. Looking at the application logs, it seems like when the issue begins, tasks are being completed by executors, however the driver is taking a while to acknowledge. I have taken numerous memory dumps of the driver (before it hits the 6 day wall) using *jcmd* and can see the *org.apache.spark.scheduler.AsyncEventQueue* is growing in size despite the fact that the application is able to keep up with batches. I have yet to take a snapshot of the application in the broken stage. > Major slowdown in spark streaming after 6 days > ---------------------------------------------- > > Key: SPARK-34063 > URL: https://issues.apache.org/jira/browse/SPARK-34063 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core > Affects Versions: 3.0.0 > Reporter: Calvin Pietersen > Priority: Major > Attachments: normal-job, slow-job > > > Spark streaming application runs at 60s batch intervals. > The application runs fine processing batches around 40s. After ~8600 batches > (around 6 days), the application all of a sudden hits a wall and processing > time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This > happens consistently every 6 days, regardless of data. > Looking at the application logs, it seems like when the issue begins, tasks > are being completed by executors, however the driver is taking a while to > acknowledge. I have taken numerous memory dumps of the driver (before it hits > the 6 day wall) using *jcmd* and can see the > *org.apache.spark.scheduler.AsyncEventQueue* is growing in size despite the > fact that the application is able to keep up with batches. I have yet to take > a snapshot of the application in the broken stage. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org