Creds on AWS are typically resolved through roles assigned to K8s pods (for example with KIAM [1]).
[1] https://github.com/uswitch/kiam On Tue, Feb 25, 2020 at 3:36 AM Yang Wang <danrtsey...@gmail.com> wrote: > Hi M Singh, > > > Mans - If we use the session based deployment option for K8 - I thought >> K8 will automatically restarts any failed TM or JM. >> In the case of failed TM - the job will probably recover, but in the case >> of failed JM - perhaps we need to resubmit all jobs. >> Let me know if I have misunderstood anything. > > > Since you are starting JM/TM with K8s deployment, when they failed new > JM/TM will be created. If you do not set the high > availability configuration, your jobs could recover when TM failed. > However, they could not recover when JM failed. With HA > configured, the jobs could always be recovered and you do not need to > re-submit again. > > > Mans - Is there any safe way of a passing creds ? > > > Yes, you are right, Using configmap to pass the credentials is not safe. > On K8s, i think you could use secrets instead[1]. > > > Mans - Does a task manager failure cause the job to fail ? My >> understanding is the JM failure are catastrophic while TM failures are >> recoverable. > > > What i mean is the job failed, and it could be restarted by your > configured restart strategy[2]. > > > Mans - So if we are saving checkpoint in S3 then there is no need for >> disks - should we use emptyDir ? > > > Yes, if you are saving the checkpoint in S3 and also set the > `high-availability.storageDir` to S3. Then you do not need persistent > volume. Since > the local directory is only used for local cache, so you could directly > use the overlay filesystem or empryDir(better io performance). > > > [1]. > https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/ > [2]. > https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#fault-tolerance > > M Singh <mans2si...@yahoo.com> 于2020年2月25日周二 上午5:53写道: > >> Thanks Wang for your detailed answers. >> >> From what I understand the native_kubernetes also leans towards creating >> a session and submitting a job to it. >> >> Regarding other recommendations, please my inline comments and advice. >> >> On Sunday, February 23, 2020, 10:01:10 PM EST, Yang Wang < >> danrtsey...@gmail.com> wrote: >> >> >> Hi Singh, >> >> Glad to hear that you are looking to run Flink on the Kubernetes. I am >> trying to answer your question based on my limited knowledge and >> others could correct me and add some more supplements. >> >> I think the biggest difference between session cluster and per-job cluster >> on Kubernetesis the isolation. Since for per-job, a dedicated Flink >> cluster >> will be started for the only one job and no any other jobs could be >> submitted. >> Once the job is finished, then the Flink cluster will be >> destroyed immediately. >> The second point is one-step submission. You do not need to start a Flink >> cluster first and then submit a job to the existing session. >> >> > Are there any benefits with regards to >> 1. Configuring the jobs >> No matter you are using the per-job cluster or submitting to the existing >> session cluster, they share the configuration mechanism. You do not have >> to change any codes and configurations. >> >> 2. Scaling the taskmanager >> Since you are using the Standalone cluster on Kubernetes, it do not >> provide >> an active resourcemanager. You need to use external tools to monitor and >> scale up the taskmanagers. The active integration is still evolving and >> you >> could have a taste[1]. >> >> Mans - If we use the session based deployment option for K8 - I thought >> K8 will automatically restarts any failed TM or JM. >> In the case of failed TM - the job will probably recover, but in the case >> of failed JM - perhaps we need to resubmit all jobs. >> Let me know if I have misunderstood anything. >> >> 3. Restarting jobs >> For the session cluster, you could directly cancel the job and re-submit. >> And >> for per-job cluster, when the job is canceled, you need to start a new >> per-job >> cluster from the latest savepoint. >> >> 4. Managing the flink jobs >> The rest api and flink command line could be used to managing the >> jobs(e.g. >> flink cancel, etc.). I think there is no difference for session and >> per-job here. >> >> 5. Passing credentials (in case of AWS, etc) >> I am not sure how do you provide your credentials. If you put them in >> the >> config map and then mount into the jobmanager/taskmanager pod, then both >> session and per-job could support this way. >> >> Mans - Is there any safe way of a passing creds ? >> >> 6. Fault tolerence and recovery of jobs from failure >> For session cluster, if one taskmanager crashed, then all the jobs which >> have tasks >> on this taskmanager will failed. >> Both session and per-job could be configured with high availability and >> recover >> from the latest checkpoint. >> >> Mans - Does a task manager failure cause the job to fail ? My >> understanding is the JM failure are catastrophic while TM failures are >> recoverable. >> >> > Is there any need for specifying volume for the pods? >> No, you do not need to specify a volume for pod. All the data in the pod >> local directory is temporary. When a pod crashed and relaunched, the >> taskmanager will retrieve the checkpoint from zookeeper + S3 and resume >> from the latest checkpoint. >> >> Mans - So if we are saving checkpoint in S3 then there is no need for >> disks - should we use emptyDir ? >> >> >> [1]. >> https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/native_kubernetes.html >> >> M Singh <mans2si...@yahoo.com> 于2020年2月23日周日 上午2:28写道: >> >> Hey Folks: >> >> I am trying to figure out the options for running Flink on Kubernetes and >> am trying to find out the pros and cons of running in Flink Session vs >> Flink Cluster mode ( >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html#flink-session-cluster-on-kubernetes >> ). >> >> I understand that in job mode there is no need to submit the job since it >> is part of the job image. But what are other the pros and cons of this >> approach vs session mode where a job manager is deployed and flink jobs can >> be submitted it ? Are there any benefits with regards to: >> >> 1. Configuring the jobs >> 2. Scaling the taskmanager >> 3. Restarting jobs >> 4. Managing the flink jobs >> 5. Passing credentials (in case of AWS, etc) >> 6. Fault tolerence and recovery of jobs from failure >> >> Also, we will be keeping the checkpoints for the jobs on S3. Is there >> any need for specifying volume for the pods ? If volume is required do we >> need provisioned volume and what are the recommended >> alternatives/considerations especially with AWS. >> >> If there are any other considerations, please let me know. >> >> Thanks for your advice. >> >> >> >> >>