surahman edited a comment on pull request #3741: URL: https://github.com/apache/incubator-heron/pull/3741#issuecomment-984255367
### This PR combines all the functionality to customize the Heron execution environment in Kubernetes. The documentation at this point in the PR can be found [here](https://github.com/apache/incubator-heron/blob/ae2a6606e51e2d122d0df819fe29d3ca4077d22c/website2/docs/schedulers-k8s-execution-environment.md). - [x] Separation of the Heron Manager and Executors: - Two `StatefulSet`s per topology. - `Manager` named `[topology-name]-manager` with a single replica. - `Executor` named `[topology-name]-executors` with multiple replicas. - Both `StatefulSet`s, the headless `Service`, and all `Persistent Volume Claims` which are generated for the topology are removed on termination. - `restart` will restart the `Manager` and all `Executor`s for a topology. - `addContainers` will only add to the replica count for `Executor`s. - `removeContainers` will only decrement the replica count for `Executor`s. - `patchStatefulSetReplicas` will only patch the `StatefulSet` for `Executor`s. - [x] Loading of Pod Templates via `heron.kubernetes.[executor | manager].pod.template=[ConfigMap].[Pod Template Name]`. - [x] Configure Persistent Volumes via `heron.kubernetes.[executor | manager].volumes.persistentVolumeClaim.[Volume Name].[Option]=[Value]` - [x] Configure resources for the manager and/or executor via `heron.kubernetes.[executor | manager].[limits | requests].[Option]=[Value]` I have completed some deployment testing and this PR is now available for review and broader testing. <details><summary>Submit command</summary> ```bash ~/bin/heron submit kubernetes ~/.heron/examples/heron-api-examples.jar \ org.apache.heron.examples.api.AckingTopology acking \ --verbose \ --config-property heron.kubernetes.executor.pod.template=pod-templ-executor.pod-template-executor.yaml \ --config-property heron.kubernetes.manager.pod.template=pod-templ-manager.pod-template-manager.yaml \ --config-property heron.kubernetes.manager.limits.cpu=2 \ --config-property heron.kubernetes.manager.limits.memory=3 \ --config-property heron.kubernetes.manager.requests.cpu=1 \ --config-property heron.kubernetes.manager.requests.memory=2 \ --config-property heron.kubernetes.executor.limits.cpu=5 \ --config-property heron.kubernetes.executor.limits.memory=6 \ --config-property heron.kubernetes.executor.requests.cpu=2 \ --config-property heron.kubernetes.executor.requests.memory=1 \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.claimName=OnDemand \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.sizeLimit=256Gi \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.volumeMode=Block \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.path=path/to/mount/dynamic/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-dynamic-volume.subPath=sub/path/to/mount/dynamic/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.claimName=OnDemand \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.storageClassName=storage-class-name \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.sizeLimit=512Gi \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.volumeMode=Block \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.path=path/to/mount/static/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-static-volume.subPath=sub/path/to/mount/static/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.claimName=requested-claim-by-user \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.path=path/to/mount/shared/volume \ --config-property heron.kubernetes.executor.volumes.persistentVolumeClaim.executor-shared-volume.subPath=sub/path/to/mount/shared/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.claimName=OnDemand \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.sizeLimit=256Gi \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.volumeMode=Block \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.path=path/to/mount/dynamic/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-dynamic-volume.subPath=sub/path/to/mount/dynamic/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.claimName=OnDemand \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.storageClassName=storage-class-name \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.accessModes=ReadWriteOnce,ReadOnlyMany \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.sizeLimit=512Gi \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.volumeMode=Block \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.path=path/to/mount/static/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-static-volume.subPath=sub/path/to/mount/static/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.claimName=requested-claim-by-user \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.path=path/to/mount/shared/volume \ --config-property heron.kubernetes.manager.volumes.persistentVolumeClaim.manager-shared-volume.subPath=sub/path/to/mount/shared/volume ``` </details> <details><summary>Manager StatefulSet</summary> ```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-12-02T00:12:20Z" generation: 1 labels: app: heron topology: acking name: acking-manager namespace: default resourceVersion: "1216" uid: c823bb62-c798-46e2-8f7c-ec7f66a663ac spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-1634139749345622293.tar.gz . && SHARD_ID=${POD_NAME##*-} && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking92ff5e65-2f7c-42c1-b8f3-aa3d9e3847d6 --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one_manager value: variable one on manager - name: var_three_manager value: variable three on manager - name: var_two_manager value: variable two on manager image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: manager ports: - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP - containerPort: 7775 name: tcp-port-kept protocol: TCP - containerPort: 7776 name: udp-port-kept protocol: UDP resources: limits: cpu: "2" memory: 3Mi requests: cpu: "1" memory: 2Mi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: manager-dynamic-volume subPath: sub/path/to/mount/dynamic/volume - mountPath: path/to/mount/shared/volume name: manager-shared-volume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: manager-static-volume subPath: sub/path/to/mount/static/volume - mountPath: /shared_volume/manager name: shared-volume-manager - image: alpine imagePullPolicy: Always name: manager-sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume/manager name: shared-volume-manager dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - name: manager-shared-volume persistentVolumeClaim: claimName: requested-claim-by-user - emptyDir: {} name: shared-volume-manager updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: manager-static-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: manager-dynamic-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: acking-manager-7596cff587 observedGeneration: 1 replicas: 1 updateRevision: acking-manager-7596cff587 updatedReplicas: 1 ``` </details> <details><summary>Executor StatefulSet</summary> ```yaml apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: "2021-12-02T00:12:20Z" generation: 1 labels: app: heron topology: acking name: acking-executors namespace: default resourceVersion: "1211" uid: 3ec133e2-591e-4864-b054-478021b8062d spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: heron topology: acking serviceName: acking template: metadata: annotations: prometheus.io/port: "8080" prometheus.io/scrape: "true" creationTimestamp: null labels: app: heron topology: acking spec: containers: - command: - sh - -c - './heron-core/bin/heron-downloader-config kubernetes && ./heron-core/bin/heron-downloader distributedlog://zookeeper:2181/heronbkdl/acking-saad-tag-0-1634139749345622293.tar.gz . && SHARD_ID=$((${POD_NAME##*-} + 1)) && echo shardId=${SHARD_ID} && ./heron-core/bin/heron-executor --topology-name=acking --topology-id=acking92ff5e65-2f7c-42c1-b8f3-aa3d9e3847d6 --topology-defn-file=acking.defn --state-manager-connection=zookeeper:2181 --state-manager-root=/heron --state-manager-config-file=./heron-conf/statemgr.yaml --tmanager-binary=./heron-core/bin/heron-tmanager --stmgr-binary=./heron-core/bin/heron-stmgr --metrics-manager-classpath=./heron-core/lib/metricsmgr/* --instance-jvm-opts="LVhYOitIZWFwRHVtcE9uT3V0T2ZNZW1vcnlFcnJvcg(61)(61)" --classpath=heron-api-examples.jar --heron-internals-config-file=./heron-conf/heron_internals.yaml --override-config-file=./heron-conf/override.yaml --component-ram-map=exclaim1:1073741824,word:1073741824 --component-jvm-opts="" --pkg-type=jar --topology-binary-file=heron-api-examples.jar --heron-java-home=$JAVA_HOME --heron-shell-binary=./heron-core/bin/heron-shell --cluster=kubernetes --role=saad --environment=default --instance-classpath=./heron-core/lib/instance/* --metrics-sinks-config-file=./heron-conf/metrics_sinks.yaml --scheduler-classpath=./heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/* --python-instance-binary=./heron-core/bin/heron-python-instance --cpp-instance-binary=./heron-core/bin/heron-cpp-instance --metricscache-manager-classpath=./heron-core/lib/metricscachemgr/* --metricscache-manager-mode=disabled --is-stateful=false --checkpoint-manager-classpath=./heron-core/lib/ckptmgr/*:./heron-core/lib/statefulstorage/*: --stateful-config-file=./heron-conf/stateful.yaml --checkpoint-manager-ram=1073741824 --health-manager-mode=disabled --health-manager-classpath=./heron-core/lib/healthmgr/* --shard=$SHARD_ID --server-port=6001 --tmanager-controller-port=6002 --tmanager-stats-port=6003 --shell-port=6004 --metrics-manager-port=6005 --scheduler-port=6006 --metricscache-manager-server-port=6007 --metricscache-manager-stats-port=6008 --checkpoint-manager-port=6009' env: - name: HOST valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: var_one value: variable one - name: var_three value: variable three - name: var_two value: variable two image: apache/heron:testbuild imagePullPolicy: IfNotPresent name: executor ports: - containerPort: 5555 name: tcp-port-kept protocol: TCP - containerPort: 5556 name: udp-port-kept protocol: UDP - containerPort: 6001 name: server protocol: TCP - containerPort: 6002 name: tmanager-ctl protocol: TCP - containerPort: 6003 name: tmanager-stats protocol: TCP - containerPort: 6004 name: shell-port protocol: TCP - containerPort: 6005 name: metrics-mgr protocol: TCP - containerPort: 6006 name: scheduler protocol: TCP - containerPort: 6007 name: metrics-cache-m protocol: TCP - containerPort: 6008 name: metrics-cache-s protocol: TCP - containerPort: 6009 name: ckptmgr protocol: TCP resources: limits: cpu: "5" memory: 6Mi requests: cpu: "2" memory: 1Mi securityContext: allowPrivilegeEscalation: false terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: path/to/mount/dynamic/volume name: executor-dynamic-volume subPath: sub/path/to/mount/dynamic/volume - mountPath: path/to/mount/shared/volume name: executor-shared-volume subPath: sub/path/to/mount/shared/volume - mountPath: path/to/mount/static/volume name: executor-static-volume subPath: sub/path/to/mount/static/volume - mountPath: /shared_volume name: shared-volume - image: alpine imagePullPolicy: Always name: sidecar-container resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /shared_volume name: shared-volume dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 10 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 10 volumes: - name: executor-shared-volume persistentVolumeClaim: claimName: requested-claim-by-user - emptyDir: {} name: shared-volume updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: executor-dynamic-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 256Gi volumeMode: Block status: phase: Pending - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null labels: onDemand: "true" topology: acking name: executor-static-volume spec: accessModes: - ReadWriteOnce - ReadOnlyMany resources: requests: storage: 512Gi storageClassName: storage-class-name volumeMode: Block status: phase: Pending status: collisionCount: 0 currentReplicas: 2 currentRevision: acking-executors-68f9654bd9 observedGeneration: 1 replicas: 2 updateRevision: acking-executors-68f9654bd9 updatedReplicas: 2 ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
