Elias,

I would also love to be able to deploy Samza on Kubernetes with dynamic
task management.  Thanks for sharing this.  It may be a good interim
solution.

Roger

On Sun, Nov 29, 2015 at 11:18 AM, Elias Levy <fearsome.lucid...@gmail.com>
wrote:

> I've been exploring Samza for stream processing as well as Kubernetes as a
> container orchestration system and I wanted to be able to use one with the
> other.  The prospect of having to execute YARN either along side or on top
> of Kubernetes did not appeal to me, so I developed a KubernetesJob
> implementation of SamzaJob.
>
> You can find the details at https://github.com/eliaslevy/samza_kubernetes,
> but in summary KubernetesJob executes and generates a serialized JobModel.
> Instead of interacting with Kubernetes directly to create the
> SamzaContainers (as the YarnJob's SamzaApplicationMaster may do with the
> YARN RM), it output a config YAML file that can be used to create the
> SamzaContainers in Kubernetes by using Resource Controllers.  For this you
> require to package your job as a Docker image.  You can reach the README at
> the above repo for details.
>
> A few observations:
>
> It would be useful if SamzaContainer accepted the JobModel via an
> environment variable.  Right not it expects a URL to download it from.  I
> get around this by using a entry point script that copies the model from an
> environment variable into a file, then passes a file URL to SamzaContainer.
>
> SamzaContainer doesn't allow you to configure the JMX port.  It selects a
> port at random from the ephemeral range as it expects to execute in YARN
> where a static port could result in a conflict.  This is not the case in
> Kubernetes where each Pod (i.e. SamzaContainer) is given its own IP
> address.
>
> This implementation doesn't provide a Samza dashboard, which in the YARN
> implementation is hosted in the Application Master.  There didn't seem to
> be much value provided by the dashboard that is not already provided by the
> Kubernetes tools for monitoring pods.
>
> I've successfully executed the hello-samza jobs in Kubernetes:
>
> $ kubectl get po
> NAME                       READY     STATUS    RESTARTS   AGE
> kafka-1-jjh8n              1/1       Running   0          2d
> kafka-2-buycp              1/1       Running   0          2d
> kafka-3-tghkp              1/1       Running   0          2d
> wikipedia-feed-0-4its2     1/1       Running   0          1d
> wikipedia-parser-0-l0onv   1/1       Running   0          17h
> wikipedia-parser-1-crrxh   1/1       Running   0          17h
> wikipedia-parser-2-1c5nn   1/1       Running   0          17h
> wikipedia-stats-0-3gaiu    1/1       Running   0          16h
> wikipedia-stats-1-j5qlk    1/1       Running   0          16h
> wikipedia-stats-2-2laos    1/1       Running   0          16h
> zookeeper-1-1sb4a          1/1       Running   0          2d
> zookeeper-2-dndk7          1/1       Running   0          2d
> zookeeper-3-46n09          1/1       Running   0          2d
>
>
> Finally, accessing services within the Kubernetes cluster from the outside
> is quite cumbersome unless one uses an external load balancer.  This makes
> it difficult to bootstrap a job, as SamzaJob must connect to Zookeeper and
> Kafka to find out the number of partitions on the topics it will subscribe
> to, so it can assign them statically among the number of containers
> requested.
>
> Ideally Samza would operate along the lines of the Kafka high-level
> consumer, which dynamically coordinate to allocate work among members of a
> consumer group.  This would do away with the new to execute SamzaJob a
> priori to generate the JobModel to pass to the SamzaContainers.  It would
> also allow for dynamically changing the number of containers without having
> the shutdown the job.
>

Reply via email to