[ 
https://issues.apache.org/jira/browse/SPARK-34063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Pietersen updated SPARK-34063:
-------------------------------------
    Description: 
Spark streaming application runs at 60s batch intervals.

The application runs fine processing batches around 40s. After ~8600 batches 
(around 6 days), the application all of a sudden hits a wall and processing 
time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This 
happens consistently every 6 days, regardless of data. 

Looking at the application logs, it seems like when the issue begins, tasks are 
being completed by executors, however the driver is taking a while to 
acknowledge. I have taken numerous memory dumps of the driver (before it hits 
the 6 day wall) using *jcmd* and can see the 
*org.apache.spark.scheduler.AsyncEventQueue* is growing in size despite the 
fact that the application is able to keep up with batches. I have yet to take a 
snapshot of the application in the broken stage.

 

 

 

  was:
Spark streaming application runs at 60s batch intervals.

 The application runs fine processing batches around 40s. After ~8600 batches 
(around 6 days), the application all of a sudden hits a wall and processing 
time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This 
happens consistently every 6 days, regardless of data.

 

Looking at the application logs, it seems like when the issue begins, tasks are 
being completed by executors, however the driver is taking a while to 
acknowledge. I have taken numerous memory dumps of the driver (before it hits 
the 6 day wall) using *jcmd* and can see the 
*org.apache.spark.scheduler.AsyncEventQueue* is growing in size despite the 
fact that the application is able to keep up with batches. I have yet to take a 
snapshot of the application in the broken stage.

 

 

 


> Major slowdown in spark streaming after 6 days
> ----------------------------------------------
>
>                 Key: SPARK-34063
>                 URL: https://issues.apache.org/jira/browse/SPARK-34063
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Calvin Pietersen
>            Priority: Major
>         Attachments: normal-job, slow-job
>
>
> Spark streaming application runs at 60s batch intervals.
> The application runs fine processing batches around 40s. After ~8600 batches 
> (around 6 days), the application all of a sudden hits a wall and processing 
> time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This 
> happens consistently every 6 days, regardless of data. 
> Looking at the application logs, it seems like when the issue begins, tasks 
> are being completed by executors, however the driver is taking a while to 
> acknowledge. I have taken numerous memory dumps of the driver (before it hits 
> the 6 day wall) using *jcmd* and can see the 
> *org.apache.spark.scheduler.AsyncEventQueue* is growing in size despite the 
> fact that the application is able to keep up with batches. I have yet to take 
> a snapshot of the application in the broken stage.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to