Awesome. Thanks. On Sun, Nov 29, 2015 at 3:25 PM, Elias Levy <fearsome.lucid...@gmail.com> wrote:
> Roger, > > You are welcomed. If you want to experiment, you can use my hello samza > <https://hub.docker.com/r/elevy/hello-samza/> Docker image. > > On Sun, Nov 29, 2015 at 12:19 PM, Roger Hoover <roger.hoo...@gmail.com> > wrote: > > > Elias, > > > > I would also love to be able to deploy Samza on Kubernetes with dynamic > > task management. Thanks for sharing this. It may be a good interim > > solution. > > > > Roger > > > > On Sun, Nov 29, 2015 at 11:18 AM, Elias Levy < > fearsome.lucid...@gmail.com> > > wrote: > > > > > I've been exploring Samza for stream processing as well as Kubernetes > as > > a > > > container orchestration system and I wanted to be able to use one with > > the > > > other. The prospect of having to execute YARN either along side or on > > top > > > of Kubernetes did not appeal to me, so I developed a KubernetesJob > > > implementation of SamzaJob. > > > > > > You can find the details at > > https://github.com/eliaslevy/samza_kubernetes, > > > but in summary KubernetesJob executes and generates a serialized > > JobModel. > > > Instead of interacting with Kubernetes directly to create the > > > SamzaContainers (as the YarnJob's SamzaApplicationMaster may do with > the > > > YARN RM), it output a config YAML file that can be used to create the > > > SamzaContainers in Kubernetes by using Resource Controllers. For this > > you > > > require to package your job as a Docker image. You can reach the > README > > at > > > the above repo for details. > > > > > > A few observations: > > > > > > It would be useful if SamzaContainer accepted the JobModel via an > > > environment variable. Right not it expects a URL to download it > from. I > > > get around this by using a entry point script that copies the model > from > > an > > > environment variable into a file, then passes a file URL to > > SamzaContainer. > > > > > > SamzaContainer doesn't allow you to configure the JMX port. It > selects a > > > port at random from the ephemeral range as it expects to execute in > YARN > > > where a static port could result in a conflict. This is not the case > in > > > Kubernetes where each Pod (i.e. SamzaContainer) is given its own IP > > > address. > > > > > > This implementation doesn't provide a Samza dashboard, which in the > YARN > > > implementation is hosted in the Application Master. There didn't seem > to > > > be much value provided by the dashboard that is not already provided by > > the > > > Kubernetes tools for monitoring pods. > > > > > > I've successfully executed the hello-samza jobs in Kubernetes: > > > > > > $ kubectl get po > > > NAME READY STATUS RESTARTS AGE > > > kafka-1-jjh8n 1/1 Running 0 2d > > > kafka-2-buycp 1/1 Running 0 2d > > > kafka-3-tghkp 1/1 Running 0 2d > > > wikipedia-feed-0-4its2 1/1 Running 0 1d > > > wikipedia-parser-0-l0onv 1/1 Running 0 17h > > > wikipedia-parser-1-crrxh 1/1 Running 0 17h > > > wikipedia-parser-2-1c5nn 1/1 Running 0 17h > > > wikipedia-stats-0-3gaiu 1/1 Running 0 16h > > > wikipedia-stats-1-j5qlk 1/1 Running 0 16h > > > wikipedia-stats-2-2laos 1/1 Running 0 16h > > > zookeeper-1-1sb4a 1/1 Running 0 2d > > > zookeeper-2-dndk7 1/1 Running 0 2d > > > zookeeper-3-46n09 1/1 Running 0 2d > > > > > > > > > Finally, accessing services within the Kubernetes cluster from the > > outside > > > is quite cumbersome unless one uses an external load balancer. This > > makes > > > it difficult to bootstrap a job, as SamzaJob must connect to Zookeeper > > and > > > Kafka to find out the number of partitions on the topics it will > > subscribe > > > to, so it can assign them statically among the number of containers > > > requested. > > > > > > Ideally Samza would operate along the lines of the Kafka high-level > > > consumer, which dynamically coordinate to allocate work among members > of > > a > > > consumer group. This would do away with the new to execute SamzaJob a > > > priori to generate the JobModel to pass to the SamzaContainers. It > would > > > also allow for dynamically changing the number of containers without > > having > > > the shutdown the job. > > > > > >