Re: Executing Samza jobs natively in Kubernetes

Jakob Homan Mon, 30 Nov 2015 12:33:07 -0800

Hey Elias-
  This is awesome work.  Would be interested in opening JIRAs for the
changes you need so we can start to process them?


Thanks,
Jakob

On 30 November 2015 at 12:18, Roger Hoover <[email protected]> wrote:
> Awesome.  Thanks.
>
> On Sun, Nov 29, 2015 at 3:25 PM, Elias Levy <[email protected]>
> wrote:
>
>> Roger,
>>
>> You are welcomed.  If you want to experiment, you can use my hello samza
>> <https://hub.docker.com/r/elevy/hello-samza/> Docker image.
>>
>> On Sun, Nov 29, 2015 at 12:19 PM, Roger Hoover <[email protected]>
>> wrote:
>>
>> > Elias,
>> >
>> > I would also love to be able to deploy Samza on Kubernetes with dynamic
>> > task management.  Thanks for sharing this.  It may be a good interim
>> > solution.
>> >
>> > Roger
>> >
>> > On Sun, Nov 29, 2015 at 11:18 AM, Elias Levy <
>> [email protected]>
>> > wrote:
>> >
>> > > I've been exploring Samza for stream processing as well as Kubernetes
>> as
>> > a
>> > > container orchestration system and I wanted to be able to use one with
>> > the
>> > > other.  The prospect of having to execute YARN either along side or on
>> > top
>> > > of Kubernetes did not appeal to me, so I developed a KubernetesJob
>> > > implementation of SamzaJob.
>> > >
>> > > You can find the details at
>> > https://github.com/eliaslevy/samza_kubernetes,
>> > > but in summary KubernetesJob executes and generates a serialized
>> > JobModel.
>> > > Instead of interacting with Kubernetes directly to create the
>> > > SamzaContainers (as the YarnJob's SamzaApplicationMaster may do with
>> the
>> > > YARN RM), it output a config YAML file that can be used to create the
>> > > SamzaContainers in Kubernetes by using Resource Controllers.  For this
>> > you
>> > > require to package your job as a Docker image.  You can reach the
>> README
>> > at
>> > > the above repo for details.
>> > >
>> > > A few observations:
>> > >
>> > > It would be useful if SamzaContainer accepted the JobModel via an
>> > > environment variable.  Right not it expects a URL to download it
>> from.  I
>> > > get around this by using a entry point script that copies the model
>> from
>> > an
>> > > environment variable into a file, then passes a file URL to
>> > SamzaContainer.
>> > >
>> > > SamzaContainer doesn't allow you to configure the JMX port.  It
>> selects a
>> > > port at random from the ephemeral range as it expects to execute in
>> YARN
>> > > where a static port could result in a conflict.  This is not the case
>> in
>> > > Kubernetes where each Pod (i.e. SamzaContainer) is given its own IP
>> > > address.
>> > >
>> > > This implementation doesn't provide a Samza dashboard, which in the
>> YARN
>> > > implementation is hosted in the Application Master.  There didn't seem
>> to
>> > > be much value provided by the dashboard that is not already provided by
>> > the
>> > > Kubernetes tools for monitoring pods.
>> > >
>> > > I've successfully executed the hello-samza jobs in Kubernetes:
>> > >
>> > > $ kubectl get po
>> > > NAME                       READY     STATUS    RESTARTS   AGE
>> > > kafka-1-jjh8n              1/1       Running   0          2d
>> > > kafka-2-buycp              1/1       Running   0          2d
>> > > kafka-3-tghkp              1/1       Running   0          2d
>> > > wikipedia-feed-0-4its2     1/1       Running   0          1d
>> > > wikipedia-parser-0-l0onv   1/1       Running   0          17h
>> > > wikipedia-parser-1-crrxh   1/1       Running   0          17h
>> > > wikipedia-parser-2-1c5nn   1/1       Running   0          17h
>> > > wikipedia-stats-0-3gaiu    1/1       Running   0          16h
>> > > wikipedia-stats-1-j5qlk    1/1       Running   0          16h
>> > > wikipedia-stats-2-2laos    1/1       Running   0          16h
>> > > zookeeper-1-1sb4a          1/1       Running   0          2d
>> > > zookeeper-2-dndk7          1/1       Running   0          2d
>> > > zookeeper-3-46n09          1/1       Running   0          2d
>> > >
>> > >
>> > > Finally, accessing services within the Kubernetes cluster from the
>> > outside
>> > > is quite cumbersome unless one uses an external load balancer.  This
>> > makes
>> > > it difficult to bootstrap a job, as SamzaJob must connect to Zookeeper
>> > and
>> > > Kafka to find out the number of partitions on the topics it will
>> > subscribe
>> > > to, so it can assign them statically among the number of containers
>> > > requested.
>> > >
>> > > Ideally Samza would operate along the lines of the Kafka high-level
>> > > consumer, which dynamically coordinate to allocate work among members
>> of
>> > a
>> > > consumer group.  This would do away with the new to execute SamzaJob a
>> > > priori to generate the JobModel to pass to the SamzaContainers.  It
>> would
>> > > also allow for dynamically changing the number of containers without
>> > having
>> > > the shutdown the job.
>> > >
>> >
>>

Re: Executing Samza jobs natively in Kubernetes

Reply via email to