Hi guys,

I'm running spark applications on kubernetes. According to spark
documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
Spark needs distributed file system to store its checkpoint data so that in
case of failure, it can recover from checkpoint directory. 

My question is, can driver and executor have separate checkpoint location?
I'm asking this because driver and executor might be deployed on different
nodes. A shared checkpoint location will require ReadWriteMany access mode.
Since I only have a storage class that supports ReadWriteOnce access mode
I'm trying to find some workaround.


In Spark Streaming Guide, it mentioned "Failed driver can be restarted from
checkpoint information" and when executor failed, "Tasks and receivers
restarted by Spark automatically, no config needed".

Given this I tried only config checkpoint location for driver pod. It
immediately failed with below exception:

2020-05-15T14:20:17.142 [stream execution thread for baseline_windower [id =
b14190ed-0fb2-4d0e-82d3-b3a3bf101712, runId =
f246774e-5a20-4bfb-b997-4d06c344bb0f]hread] ERROR
org.apache.spark.sql.execution.streaming.MicroBatchExecution - Query
baseline_windower [id = b14190ed-0fb2-4d0e-82d3-b3a3bf101712, runId =
f246774e-5a20-4bfb-b997-4d06c344bb0f] terminated with error
org.apache.spark.SparkException: Writing job aborted.
        at
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:92)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
...
        at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 11.0 failed 4 times, most recent failure: Lost task
0.3 in stage 11.0 (TID 420, 10.233.124.162, executor 1):
java.io.IOException: mkdir of
file:/opt/window-data/baseline_windower/state/0/0 failed


So I tried giving separate checkpoint location to driver and executor:

In spark application helm chart I have a checkpointlocation configuration:

spec:
   sparkConf:
     "spark.sql.streaming.checkpointLocation": "file:///opt/checkpoint-data"

I created two checkpoint pvc and mount the volume for driver and executor
pod:

  volumes:
    - name: checkpoint-driver-volume
      persistentVolumeClaim:
        claimName: checkpoint-driver-pvc
    - name: checkpoint-executor-volume
      persistentVolumeClaim:
        claimName: checkpoint-executor-pvc

driver:
    volumeMounts:
      - name: checkpoint-driver-volume
        mountPath: "/opt/checkpoint-data"
...
executor:
    volumeMounts:
      - name: checkpoint-executor-volume
        mountPath: "/opt/checkpoint-data"

After deployment it seems to be working. I tried restarted driver pod and it
did recover from checkpoint directory. But I'm not sure if this is actually
supported by design.

Thanks,
wzhan



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to