HeartSaVioR edited a comment on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846329353


   I'm not sure about the scenario of leveraging PVC as checkpoint location - 
at least that sounds to me as beyond the support of checkpoint in Structured 
Streaming.
   
   We have been clearly describing about the requirement of checkpoint location 
in Structured Streaming guide page, like following:
   
   > Checkpoint location: For some output sinks where the end-to-end 
fault-tolerance can be guaranteed, specify the location where the system will 
write all the checkpoint information. This should be a directory in an 
HDFS-compatible fault-tolerant file system. The semantics of checkpointing is 
discussed in more detail in the next section.
   
   I know we allow custom checkpoint manager implementations to deal with 
non-HDFS compatible file system (like object stores which don't provide "atomic 
rename"), but they still deal with "remote" "fault-tolerant" file system, and 
doesn't require Spark scheduler to schedule specific task to specific executor 
based on the availability of checkpoint.
   
   In other words, only checkpoint manager handles the complexity of checkpoint 
on file system, not somewhere else. And sounds like it's no longer holding true 
if we want to support PVC based checkpoint. Please correct me if I'm missing 
something.
   
   I'm more likely novice on cloud/k8s, but from the common sense, I guess the 
actual storage of PVC should be still a sort of network storage to be resilient 
on "node down". I'm wondering how much benefits PVC approach gives compared to 
the existing approach as just directly use remote fault-tolerant file system. 
The benefits should be clear to cope with additional complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to