[jira] [Updated] (SPARK-46702) Spark Cluster Crashing

Mohamad Haidar (Jira) Fri, 12 Jan 2024 07:43:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-46702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mohamad Haidar updated SPARK-46702:
-----------------------------------
    Description: 
h3. Description:
h3. 1.       We have a spark cluster installed over a k8s cluster with one 
driver and multiple executors (120).
h3. 2.       We configure our batch duration to 30 seconds.
h3. 3.       The Spark Cluster is reading from a 120 partition topic at Kafka 
and writing to an hourly index at ElasticSearch.
h3. 4.       ES has 30 DataNodes, 1 shard per DataNode for each index.
h3. 5.       Configuration of Driver STS is in Appendix.
h3. 6.       Thre driver is observed periodically restarting every 10 mins, 
although the restart do not necessarily occur each 10mins, but when it happens 
it happens each 10 mins.
h3. 7.       The restarts frequency increase with the increase of the 
throughput.
h3. 8.       When the restarts are happening, we see OptionalDataException, 
attached 
“logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log”
 is the log resulting in a restart of the driver.
h3. Analysis:
 # We’ve done a test with 250 K Records/second, and the processing was good 
between 15 and 20 seconds.
 # We were able to avoid all the restarts by simply disabling liveness checks.
 # This resulted in NO RESTARTS to Streaming Core, we tried the above with two 
scenarios:

 * Speculation Disabled --> After 10 to 20 minutes the batch duration increased 
to minutes and eventually processing was very slow, during which, main error 
logs observed are about {*}The executor with id 7 exited with exit code 
50(Uncaught exception).{*}, logs at WARN level and TRACE level were collected:

 * {*}WARN{*}: Logs attached 
“cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
 * {*}TRACE{*}: Logs attached “”

 * Speculation Enabled -->  the batch duration increased to minutes (big lag) 
only after around 2 hours, logs related are 
“cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.

h3. Conclusion:
 * The liveness check is failing and thus causing the restarts.
 * The logs indicates that there are some unhandled exceptions to executors.
 * Issue can be somewhere else as well, below is the liveness check that was 
disabled and that was causing the restarts initially every 10 mins after 3 
occurrances.

spark_application_id=$(curl localhost:4040/api/v1/applications | jq '.[0] | 
."id"') #This variable will store the application-ID which will be extracted 
from the json data received from the spark consumer 
itselfspark_application_id_formatted=$(echo $spark_application_id | sed 
's/^.//' | sed 's/.$//') #This varialbe will the store the formatted 
application-ID where the first and the last character will be removed, which 
are (") quotes.spark_failed_job_queue_length=$(curl 
localhost:4040/api/v1/applications/$spark_application_id_formatted/jobs?status=failed
 | jq length) #This variable will store the length of the queue of failed 
operations received from the spark consumer itself if [ 
"$spark_failed_job_queue_length" -eq "0" ]; then #Checking whether the failed 
spark jobs queue length is zero       exit 0 #If the length of failed spark 
jobs is zero, then the script will return exit code 0, indicating successelse   
    exit 1 #If the length of failed spark jbs is greater than zero, then the 
script will return exit code 1, indicating failurefi
h3. Next Action:
 * Please help us identify the RC of the issue, we’ve tried too many 
configurations and with 2 different spark versions 3.4 and 3.5 and we’re not 
able to avoid the issue.

 

Appendix:

 

!image-2024-01-12-10-37-50-117.png!

!image-2024-01-12-10-38-22-245.png!

!image-2024-01-12-10-38-15-835.png!

!image-2024-01-12-10-38-34-247.png!

  was:
h3. Description:
h3. 1.       We have a spark cluster installed over a k8s cluster with one 
driver and multiple executors (120).
h3. 2.       We configure our batch duration to 30 seconds.
h3. 3.       The Spark Cluster is reading from a 120 partition topic at Kafka 
and writing to an hourly index at ElasticSearch.
h3. 4.       ES has 30 DataNodes, 1 shard per DataNode for each index.
h3. 5.       Configuration of Driver STS is in Appendix.
h3. 6.       Thre driver is observed periodically restarting every 10 mins, 
although the restart do not necessarily occur each 10mins, but when it happens 
it happens each 10 mins.
h3. 7.       The restarts frequency increase with the increase of the 
throughput.
h3. 8.       When the restarts are happening, we see OptionalDataException, 
attached “” is the log resulting in a restart of the driver.
h3. Analysis:
 # We’ve done a test with 250 K Records/second, and the processing was good 
between 15 and 20 seconds.
 # We were able to avoid all the restarts by simply disabling liveness checks.
 # This resulted in NO RESTARTS to Streaming Core, we tried the above with two 
scenarios:
 * Speculation Disabled --> After 10 to 20 minutes the batch duration increased 
to minutes and eventually processing was very slow, during which, main error 
logs observed are about {*}The executor with id 7 exited with exit code 
50(Uncaught exception).{*}, logs at WARN level and TRACE level were collected:

 * {*}WARN{*}: Logs attached “”
 * {*}TRACE{*}: Logs attached “”

 * Speculation Enabled -->  the batch duration increased to minutes (big lag) 
only after around 2 hours, logs related are “”.

h3. Conclusion:
 * The liveness check is failing and thus causing the restarts.
 * The logs indicates that there are some unhandled exceptions to executors.
 * Issue can be somewhere else as well, below is the liveness check that was 
disabled and that was causing the restarts initially every 10 mins after 3 
occurrances.

spark_application_id=$(curl localhost:4040/api/v1/applications | jq '.[0] | 
."id"') #This variable will store the application-ID which will be extracted 
from the json data received from the spark consumer 
itselfspark_application_id_formatted=$(echo $spark_application_id | sed 
's/^.//' | sed 's/.$//') #This varialbe will the store the formatted 
application-ID where the first and the last character will be removed, which 
are (") quotes.spark_failed_job_queue_length=$(curl 
localhost:4040/api/v1/applications/$spark_application_id_formatted/jobs?status=failed
 | jq length) #This variable will store the length of the queue of failed 
operations received from the spark consumer itself if [ 
"$spark_failed_job_queue_length" -eq "0" ]; then #Checking whether the failed 
spark jobs queue length is zero       exit 0 #If the length of failed spark 
jobs is zero, then the script will return exit code 0, indicating successelse   
    exit 1 #If the length of failed spark jbs is greater than zero, then the 
script will return exit code 1, indicating failurefi
h3. Next Action:
 * Please help us identify the RC of the issue, we’ve tried too many 
configurations and with 2 different spark versions 3.4 and 3.5 and we’re not 
able to avoid the issue.

 

Appendix:

 

!image-2024-01-12-10-37-50-117.png!

!image-2024-01-12-10-38-22-245.png!

!image-2024-01-12-10-38-15-835.png!

!image-2024-01-12-10-38-34-247.png!


> Spark Cluster Crashing
> ----------------------
>
>                 Key: SPARK-46702
>                 URL: https://issues.apache.org/jira/browse/SPARK-46702
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Docker
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: Mohamad Haidar
>            Priority: Blocker
>         Attachments: CV62A4~1.LOG, 
> cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log,
>  
> logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log
>
>
> h3. Description:
> h3. 1.       We have a spark cluster installed over a k8s cluster with one 
> driver and multiple executors (120).
> h3. 2.       We configure our batch duration to 30 seconds.
> h3. 3.       The Spark Cluster is reading from a 120 partition topic at Kafka 
> and writing to an hourly index at ElasticSearch.
> h3. 4.       ES has 30 DataNodes, 1 shard per DataNode for each index.
> h3. 5.       Configuration of Driver STS is in Appendix.
> h3. 6.       Thre driver is observed periodically restarting every 10 mins, 
> although the restart do not necessarily occur each 10mins, but when it 
> happens it happens each 10 mins.
> h3. 7.       The restarts frequency increase with the increase of the 
> throughput.
> h3. 8.       When the restarts are happening, we see OptionalDataException, 
> attached 
> “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log”
>  is the log resulting in a restart of the driver.
> h3. Analysis:
>  # We’ve done a test with 250 K Records/second, and the processing was good 
> between 15 and 20 seconds.
>  # We were able to avoid all the restarts by simply disabling liveness checks.
>  # This resulted in NO RESTARTS to Streaming Core, we tried the above with 
> two scenarios:
>  * Speculation Disabled --> After 10 to 20 minutes the batch duration 
> increased to minutes and eventually processing was very slow, during which, 
> main error logs observed are about {*}The executor with id 7 exited with exit 
> code 50(Uncaught exception).{*}, logs at WARN level and TRACE level were 
> collected:
>  * {*}WARN{*}: Logs attached 
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
>  * {*}TRACE{*}: Logs attached “”
>  * Speculation Enabled -->  the batch duration increased to minutes (big lag) 
> only after around 2 hours, logs related are 
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.
> h3. Conclusion:
>  * The liveness check is failing and thus causing the restarts.
>  * The logs indicates that there are some unhandled exceptions to executors.
>  * Issue can be somewhere else as well, below is the liveness check that was 
> disabled and that was causing the restarts initially every 10 mins after 3 
> occurrances.
> spark_application_id=$(curl localhost:4040/api/v1/applications | jq '.[0] | 
> ."id"') #This variable will store the application-ID which will be extracted 
> from the json data received from the spark consumer 
> itselfspark_application_id_formatted=$(echo $spark_application_id | sed 
> 's/^.//' | sed 's/.$//') #This varialbe will the store the formatted 
> application-ID where the first and the last character will be removed, which 
> are (") quotes.spark_failed_job_queue_length=$(curl 
> localhost:4040/api/v1/applications/$spark_application_id_formatted/jobs?status=failed
>  | jq length) #This variable will store the length of the queue of failed 
> operations received from the spark consumer itself if [ 
> "$spark_failed_job_queue_length" -eq "0" ]; then #Checking whether the failed 
> spark jobs queue length is zero       exit 0 #If the length of failed spark 
> jobs is zero, then the script will return exit code 0, indicating successelse 
>       exit 1 #If the length of failed spark jbs is greater than zero, then 
> the script will return exit code 1, indicating failurefi
> h3. Next Action:
>  * Please help us identify the RC of the issue, we’ve tried too many 
> configurations and with 2 different spark versions 3.4 and 3.5 and we’re not 
> able to avoid the issue.
>  
> Appendix:
>  
> !image-2024-01-12-10-37-50-117.png!
> !image-2024-01-12-10-38-22-245.png!
> !image-2024-01-12-10-38-15-835.png!
> !image-2024-01-12-10-38-34-247.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46702) Spark Cluster Crashing

Reply via email to