[ 
https://issues.apache.org/jira/browse/SPARK-46702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohamad Haidar updated SPARK-46702:
-----------------------------------
    Description: 
h3. Description:
 * We have a spark cluster installed over a k8s cluster with one driver and 
multiple executors (120).
 * We configure our batch duration to 30 seconds.
 * The Spark Cluster is reading from a 120 partition topic at Kafka and writing 
to an hourly index at ElasticSearch.
 * ES has 30 DataNodes, 1 shard per DataNode for each index.
 * Configuration of Driver STS is in Appendix.
 * The driver is observed periodically restarting every 10 mins, although the 
restart do not necessarily occur each 10mins, but when it happens it happens 
each 10 mins.
 * The restarts frequency increase with the increase of the throughput.
 * When the restarts are happening, we see OptionalDataException, attached 
“logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log”
 is the log resulting in a restart of the driver.

h3. Analysis:
 # We’ve done a test with 250 K Records/second, and the processing was good 
between 15 and 20 seconds.
 # We were able to avoid all the restarts by simply disabling liveness checks.
 # This resulted in NO RESTARTS to Streaming Core, we tried the above with two 
scenarios:

 * Speculation Disabled --> After 10 to 20 minutes the batch duration increased 
to minutes and eventually processing was very slow, during which, main error 
logs observed are about {*}The executor with id 7 exited with exit code 
50(Uncaught exception).{*}, logs at WARN level and TRACE level were collected:

 * {*}WARN{*}: Logs attached 
“cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
 * {*}TRACE{*}: Logs attached “cveshv-events-streaming-TRACE (2).zip”

 * Speculation Enabled -->  the batch duration increased to minutes (big lag) 
only after around 2 hours, logs related are 
“cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.

h3. Conclusion:
 * The liveness check is failing and thus causing the restarts.
 * The logs indicates that there are some unhandled exceptions to executors.
 * Issue can be somewhere else as well, below is the liveness check that was 
disabled and that was causing the restarts initially every 10 mins after 3 
occurrences.

 
h3. !image-2024-01-12-10-44-45-717.png!
h3. Next Action:
 * Please help us identify the RC of the issue, we’ve tried too many 
configurations and with 2 different spark versions 3.4 and 3.5 and we’re not 
able to avoid the issue.

 

Appendix:

 

!image-2024-01-12-10-45-18-905.png!

!image-2024-01-12-10-45-30-398.png!

!image-2024-01-12-10-45-40-397.png!

!image-2024-01-12-10-45-50-427.png!

  was:
h3. Description:
h3. 1.       We have a spark cluster installed over a k8s cluster with one 
driver and multiple executors (120).
h3. 2.       We configure our batch duration to 30 seconds.
h3. 3.       The Spark Cluster is reading from a 120 partition topic at Kafka 
and writing to an hourly index at ElasticSearch.
h3. 4.       ES has 30 DataNodes, 1 shard per DataNode for each index.
h3. 5.       Configuration of Driver STS is in Appendix.
h3. 6.       Thre driver is observed periodically restarting every 10 mins, 
although the restart do not necessarily occur each 10mins, but when it happens 
it happens each 10 mins.
h3. 7.       The restarts frequency increase with the increase of the 
throughput.
h3. 8.       When the restarts are happening, we see OptionalDataException, 
attached 
“logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log”
 is the log resulting in a restart of the driver.
h3. Analysis:
 # We’ve done a test with 250 K Records/second, and the processing was good 
between 15 and 20 seconds.
 # We were able to avoid all the restarts by simply disabling liveness checks.
 # This resulted in NO RESTARTS to Streaming Core, we tried the above with two 
scenarios:

 * Speculation Disabled --> After 10 to 20 minutes the batch duration increased 
to minutes and eventually processing was very slow, during which, main error 
logs observed are about {*}The executor with id 7 exited with exit code 
50(Uncaught exception).{*}, logs at WARN level and TRACE level were collected:

 * {*}WARN{*}: Logs attached 
“cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
 * {*}TRACE{*}: Logs attached “cveshv-events-streaming-TRACE (2).zip”

 * Speculation Enabled -->  the batch duration increased to minutes (big lag) 
only after around 2 hours, logs related are 
“cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.

h3. Conclusion:
 * The liveness check is failing and thus causing the restarts.
 * The logs indicates that there are some unhandled exceptions to executors.
 * Issue can be somewhere else as well, below is the liveness check that was 
disabled and that was causing the restarts initially every 10 mins after 3 
occurrences.

 
h3. !image-2024-01-12-10-44-45-717.png!
h3. Next Action:
 * Please help us identify the RC of the issue, we’ve tried too many 
configurations and with 2 different spark versions 3.4 and 3.5 and we’re not 
able to avoid the issue.

 

Appendix:

 

!image-2024-01-12-10-45-18-905.png!

!image-2024-01-12-10-45-30-398.png!

!image-2024-01-12-10-45-40-397.png!

!image-2024-01-12-10-45-50-427.png!


> Spark Cluster Crashing
> ----------------------
>
>                 Key: SPARK-46702
>                 URL: https://issues.apache.org/jira/browse/SPARK-46702
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Docker
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: Mohamad Haidar
>            Priority: Blocker
>         Attachments: CV62A4~1.LOG, cveshv-events-streaming-TRACE (2).zip, 
> cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log,
>  image-2024-01-12-10-44-45-717.png, image-2024-01-12-10-45-18-905.png, 
> image-2024-01-12-10-45-30-398.png, image-2024-01-12-10-45-40-397.png, 
> image-2024-01-12-10-45-50-427.png, 
> logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log
>
>
> h3. Description:
>  * We have a spark cluster installed over a k8s cluster with one driver and 
> multiple executors (120).
>  * We configure our batch duration to 30 seconds.
>  * The Spark Cluster is reading from a 120 partition topic at Kafka and 
> writing to an hourly index at ElasticSearch.
>  * ES has 30 DataNodes, 1 shard per DataNode for each index.
>  * Configuration of Driver STS is in Appendix.
>  * The driver is observed periodically restarting every 10 mins, although the 
> restart do not necessarily occur each 10mins, but when it happens it happens 
> each 10 mins.
>  * The restarts frequency increase with the increase of the throughput.
>  * When the restarts are happening, we see OptionalDataException, attached 
> “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log”
>  is the log resulting in a restart of the driver.
> h3. Analysis:
>  # We’ve done a test with 250 K Records/second, and the processing was good 
> between 15 and 20 seconds.
>  # We were able to avoid all the restarts by simply disabling liveness checks.
>  # This resulted in NO RESTARTS to Streaming Core, we tried the above with 
> two scenarios:
>  * Speculation Disabled --> After 10 to 20 minutes the batch duration 
> increased to minutes and eventually processing was very slow, during which, 
> main error logs observed are about {*}The executor with id 7 exited with exit 
> code 50(Uncaught exception).{*}, logs at WARN level and TRACE level were 
> collected:
>  * {*}WARN{*}: Logs attached 
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
>  * {*}TRACE{*}: Logs attached “cveshv-events-streaming-TRACE (2).zip”
>  * Speculation Enabled -->  the batch duration increased to minutes (big lag) 
> only after around 2 hours, logs related are 
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.
> h3. Conclusion:
>  * The liveness check is failing and thus causing the restarts.
>  * The logs indicates that there are some unhandled exceptions to executors.
>  * Issue can be somewhere else as well, below is the liveness check that was 
> disabled and that was causing the restarts initially every 10 mins after 3 
> occurrences.
>  
> h3. !image-2024-01-12-10-44-45-717.png!
> h3. Next Action:
>  * Please help us identify the RC of the issue, we’ve tried too many 
> configurations and with 2 different spark versions 3.4 and 3.5 and we’re not 
> able to avoid the issue.
>  
> Appendix:
>  
> !image-2024-01-12-10-45-18-905.png!
> !image-2024-01-12-10-45-30-398.png!
> !image-2024-01-12-10-45-40-397.png!
> !image-2024-01-12-10-45-50-427.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to