viirya commented on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846696394


   > My major point is about the characteristic of the checkpoint location.
   > 
   > We require checkpoint location to be "fault-tolerant" including hardware 
failures (local storage doesn't make sense here), and provide "high 
availability" by itself so that Spark can delegate such complexity to the 
checkpoint location. For sure, such requirement leads underlying file system to 
be heavy and non-trivial to maintain, but IMHO that's not an enough reason to 
take the complexity back to Spark, because:
   
   I think the users face one major issue is, they don't have choice. For 
"fault-tolerant", as we consider PVC as an abstract way to look at storage, it 
can support that if the storage class supports the feature. Actually there is 
storage class supporting that. Again, it is about user-choice. Users can choose 
from different storage classes for PVC. How often the fault can occur and how 
serious a fault could be for the streaming app? Not to mention there is also 
snapshot support for volumes on K8S. From less to more, users can choose 
different storage classes to meet their requirements.
   
   For example, for a streaming app that fault may not be too serious issue, 
maybe local storage + occasional snapshot or local storage with raid may be 
good enough?
   
   For industry usage, sometimes it is not easy to ask whatever file system to 
use, e.g. Object stores in Azure or GCS or others, if the users want. Any 
backend file system adoption requires organization change, talent hiring, 
system engineering team support, policy change, etc.
   
   > I'd interpret the reasons as two folds:
   > 
   > A. Majority of real-world workloads are working well with current 
technology
   > B. Some workloads don't work well, but no strong demand on this as 
possible issues are tolerable
   
   I don't want to guess it here, but maybe another one possibility, they are 
moved to other streaming engine which can support their workloads easily.
   
   > I'd be happy to see the overall system design and the result of POC. Let's 
continue the talk about PVC once we get the details.
   
   Sure. This stage-level scheduling is a different direction than my original 
proposal. I need to take some time on revising it. I will keep it posted in 
other place e.g. new JIRA.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to