Hey Elias- This is awesome work. Would be interested in opening JIRAs for the changes you need so we can start to process them?
Thanks, Jakob On 30 November 2015 at 12:18, Roger Hoover <roger.hoo...@gmail.com> wrote: > Awesome. Thanks. > > On Sun, Nov 29, 2015 at 3:25 PM, Elias Levy <fearsome.lucid...@gmail.com> > wrote: > >> Roger, >> >> You are welcomed. If you want to experiment, you can use my hello samza >> <https://hub.docker.com/r/elevy/hello-samza/> Docker image. >> >> On Sun, Nov 29, 2015 at 12:19 PM, Roger Hoover <roger.hoo...@gmail.com> >> wrote: >> >> > Elias, >> > >> > I would also love to be able to deploy Samza on Kubernetes with dynamic >> > task management. Thanks for sharing this. It may be a good interim >> > solution. >> > >> > Roger >> > >> > On Sun, Nov 29, 2015 at 11:18 AM, Elias Levy < >> fearsome.lucid...@gmail.com> >> > wrote: >> > >> > > I've been exploring Samza for stream processing as well as Kubernetes >> as >> > a >> > > container orchestration system and I wanted to be able to use one with >> > the >> > > other. The prospect of having to execute YARN either along side or on >> > top >> > > of Kubernetes did not appeal to me, so I developed a KubernetesJob >> > > implementation of SamzaJob. >> > > >> > > You can find the details at >> > https://github.com/eliaslevy/samza_kubernetes, >> > > but in summary KubernetesJob executes and generates a serialized >> > JobModel. >> > > Instead of interacting with Kubernetes directly to create the >> > > SamzaContainers (as the YarnJob's SamzaApplicationMaster may do with >> the >> > > YARN RM), it output a config YAML file that can be used to create the >> > > SamzaContainers in Kubernetes by using Resource Controllers. For this >> > you >> > > require to package your job as a Docker image. You can reach the >> README >> > at >> > > the above repo for details. >> > > >> > > A few observations: >> > > >> > > It would be useful if SamzaContainer accepted the JobModel via an >> > > environment variable. Right not it expects a URL to download it >> from. I >> > > get around this by using a entry point script that copies the model >> from >> > an >> > > environment variable into a file, then passes a file URL to >> > SamzaContainer. >> > > >> > > SamzaContainer doesn't allow you to configure the JMX port. It >> selects a >> > > port at random from the ephemeral range as it expects to execute in >> YARN >> > > where a static port could result in a conflict. This is not the case >> in >> > > Kubernetes where each Pod (i.e. SamzaContainer) is given its own IP >> > > address. >> > > >> > > This implementation doesn't provide a Samza dashboard, which in the >> YARN >> > > implementation is hosted in the Application Master. There didn't seem >> to >> > > be much value provided by the dashboard that is not already provided by >> > the >> > > Kubernetes tools for monitoring pods. >> > > >> > > I've successfully executed the hello-samza jobs in Kubernetes: >> > > >> > > $ kubectl get po >> > > NAME READY STATUS RESTARTS AGE >> > > kafka-1-jjh8n 1/1 Running 0 2d >> > > kafka-2-buycp 1/1 Running 0 2d >> > > kafka-3-tghkp 1/1 Running 0 2d >> > > wikipedia-feed-0-4its2 1/1 Running 0 1d >> > > wikipedia-parser-0-l0onv 1/1 Running 0 17h >> > > wikipedia-parser-1-crrxh 1/1 Running 0 17h >> > > wikipedia-parser-2-1c5nn 1/1 Running 0 17h >> > > wikipedia-stats-0-3gaiu 1/1 Running 0 16h >> > > wikipedia-stats-1-j5qlk 1/1 Running 0 16h >> > > wikipedia-stats-2-2laos 1/1 Running 0 16h >> > > zookeeper-1-1sb4a 1/1 Running 0 2d >> > > zookeeper-2-dndk7 1/1 Running 0 2d >> > > zookeeper-3-46n09 1/1 Running 0 2d >> > > >> > > >> > > Finally, accessing services within the Kubernetes cluster from the >> > outside >> > > is quite cumbersome unless one uses an external load balancer. This >> > makes >> > > it difficult to bootstrap a job, as SamzaJob must connect to Zookeeper >> > and >> > > Kafka to find out the number of partitions on the topics it will >> > subscribe >> > > to, so it can assign them statically among the number of containers >> > > requested. >> > > >> > > Ideally Samza would operate along the lines of the Kafka high-level >> > > consumer, which dynamically coordinate to allocate work among members >> of >> > a >> > > consumer group. This would do away with the new to execute SamzaJob a >> > > priori to generate the JobModel to pass to the SamzaContainers. It >> would >> > > also allow for dynamically changing the number of containers without >> > having >> > > the shutdown the job. >> > > >> > >>