I've been exploring Samza for stream processing as well as Kubernetes as a
container orchestration system and I wanted to be able to use one with the
other.  The prospect of having to execute YARN either along side or on top
of Kubernetes did not appeal to me, so I developed a KubernetesJob
implementation of SamzaJob.

You can find the details at https://github.com/eliaslevy/samza_kubernetes,
but in summary KubernetesJob executes and generates a serialized JobModel.
Instead of interacting with Kubernetes directly to create the
SamzaContainers (as the YarnJob's SamzaApplicationMaster may do with the
YARN RM), it output a config YAML file that can be used to create the
SamzaContainers in Kubernetes by using Resource Controllers.  For this you
require to package your job as a Docker image.  You can reach the README at
the above repo for details.

A few observations:

It would be useful if SamzaContainer accepted the JobModel via an
environment variable.  Right not it expects a URL to download it from.  I
get around this by using a entry point script that copies the model from an
environment variable into a file, then passes a file URL to SamzaContainer.

SamzaContainer doesn't allow you to configure the JMX port.  It selects a
port at random from the ephemeral range as it expects to execute in YARN
where a static port could result in a conflict.  This is not the case in
Kubernetes where each Pod (i.e. SamzaContainer) is given its own IP
address.

This implementation doesn't provide a Samza dashboard, which in the YARN
implementation is hosted in the Application Master.  There didn't seem to
be much value provided by the dashboard that is not already provided by the
Kubernetes tools for monitoring pods.

I've successfully executed the hello-samza jobs in Kubernetes:

$ kubectl get po
NAME                       READY     STATUS    RESTARTS   AGE
kafka-1-jjh8n              1/1       Running   0          2d
kafka-2-buycp              1/1       Running   0          2d
kafka-3-tghkp              1/1       Running   0          2d
wikipedia-feed-0-4its2     1/1       Running   0          1d
wikipedia-parser-0-l0onv   1/1       Running   0          17h
wikipedia-parser-1-crrxh   1/1       Running   0          17h
wikipedia-parser-2-1c5nn   1/1       Running   0          17h
wikipedia-stats-0-3gaiu    1/1       Running   0          16h
wikipedia-stats-1-j5qlk    1/1       Running   0          16h
wikipedia-stats-2-2laos    1/1       Running   0          16h
zookeeper-1-1sb4a          1/1       Running   0          2d
zookeeper-2-dndk7          1/1       Running   0          2d
zookeeper-3-46n09          1/1       Running   0          2d


Finally, accessing services within the Kubernetes cluster from the outside
is quite cumbersome unless one uses an external load balancer.  This makes
it difficult to bootstrap a job, as SamzaJob must connect to Zookeeper and
Kafka to find out the number of partitions on the topics it will subscribe
to, so it can assign them statically among the number of containers
requested.

Ideally Samza would operate along the lines of the Kafka high-level
consumer, which dynamically coordinate to allocate work among members of a
consumer group.  This would do away with the new to execute SamzaJob a
priori to generate the JobModel to pass to the SamzaContainers.  It would
also allow for dynamically changing the number of containers without having
the shutdown the job.

Reply via email to