Elias, I would also love to be able to deploy Samza on Kubernetes with dynamic task management. Thanks for sharing this. It may be a good interim solution.
Roger On Sun, Nov 29, 2015 at 11:18 AM, Elias Levy <fearsome.lucid...@gmail.com> wrote: > I've been exploring Samza for stream processing as well as Kubernetes as a > container orchestration system and I wanted to be able to use one with the > other. The prospect of having to execute YARN either along side or on top > of Kubernetes did not appeal to me, so I developed a KubernetesJob > implementation of SamzaJob. > > You can find the details at https://github.com/eliaslevy/samza_kubernetes, > but in summary KubernetesJob executes and generates a serialized JobModel. > Instead of interacting with Kubernetes directly to create the > SamzaContainers (as the YarnJob's SamzaApplicationMaster may do with the > YARN RM), it output a config YAML file that can be used to create the > SamzaContainers in Kubernetes by using Resource Controllers. For this you > require to package your job as a Docker image. You can reach the README at > the above repo for details. > > A few observations: > > It would be useful if SamzaContainer accepted the JobModel via an > environment variable. Right not it expects a URL to download it from. I > get around this by using a entry point script that copies the model from an > environment variable into a file, then passes a file URL to SamzaContainer. > > SamzaContainer doesn't allow you to configure the JMX port. It selects a > port at random from the ephemeral range as it expects to execute in YARN > where a static port could result in a conflict. This is not the case in > Kubernetes where each Pod (i.e. SamzaContainer) is given its own IP > address. > > This implementation doesn't provide a Samza dashboard, which in the YARN > implementation is hosted in the Application Master. There didn't seem to > be much value provided by the dashboard that is not already provided by the > Kubernetes tools for monitoring pods. > > I've successfully executed the hello-samza jobs in Kubernetes: > > $ kubectl get po > NAME READY STATUS RESTARTS AGE > kafka-1-jjh8n 1/1 Running 0 2d > kafka-2-buycp 1/1 Running 0 2d > kafka-3-tghkp 1/1 Running 0 2d > wikipedia-feed-0-4its2 1/1 Running 0 1d > wikipedia-parser-0-l0onv 1/1 Running 0 17h > wikipedia-parser-1-crrxh 1/1 Running 0 17h > wikipedia-parser-2-1c5nn 1/1 Running 0 17h > wikipedia-stats-0-3gaiu 1/1 Running 0 16h > wikipedia-stats-1-j5qlk 1/1 Running 0 16h > wikipedia-stats-2-2laos 1/1 Running 0 16h > zookeeper-1-1sb4a 1/1 Running 0 2d > zookeeper-2-dndk7 1/1 Running 0 2d > zookeeper-3-46n09 1/1 Running 0 2d > > > Finally, accessing services within the Kubernetes cluster from the outside > is quite cumbersome unless one uses an external load balancer. This makes > it difficult to bootstrap a job, as SamzaJob must connect to Zookeeper and > Kafka to find out the number of partitions on the topics it will subscribe > to, so it can assign them statically among the number of containers > requested. > > Ideally Samza would operate along the lines of the Kafka high-level > consumer, which dynamically coordinate to allocate work among members of a > consumer group. This would do away with the new to execute SamzaJob a > priori to generate the JobModel to pass to the SamzaContainers. It would > also allow for dynamically changing the number of containers without having > the shutdown the job. >