Re: Getting Beam(Python)-on-Flink-on-k8s to work

Kyle Weaver Fri, 28 Aug 2020 17:03:31 -0700

> rpc error: code = Unimplemented desc = Method not found:
org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest


This is a known issue: https://issues.apache.org/jira/browse/BEAM-10762

On Fri, Aug 28, 2020 at 4:57 PM Eugene Kirpichov <ekirpic...@gmail.com>
wrote:

> P.S. Ironic how back in 2018 I was TL-ing the portable runners effort for
> a few months on Google side, and now I need community help to get it to
> work at all.
> Still pretty miraculous how far Beam's portability has come since
> then, even if it has a steep learning curve.
>
> On Fri, Aug 28, 2020 at 4:54 PM Eugene Kirpichov <ekirpic...@gmail.com>
> wrote:
>
>> Hi Sam,
>>
>> You're a wizard - this got me *way* farther than my previous attempts.
>> Here's a PR https://github.com/sambvfx/beam-flink-k8s/pull/1 with a
>> couple of changes I had to make.
>>
>> I had to make some additional changes that do not make sense to share,
>> but here they are for the record:
>> - Because I'm running on k8s engine and not minikube, I had to put the
>> docker-flink image on GCR, changing flink.yaml "image: docker-flink:1.10"
>> -> "image: us.gcr.io/$PROJECT_ID/docker-flink:1.10". I of course also
>> had to build and push the container.
>> - Because I'm running with a custom container based on an unreleased
>> version of Beam, I had to push my custom container to GCR too, and change
>> your instructions to use that image name instead of the default one
>> - To fetch the containers from GCR, I had to log into Docker inside the
>> Flink nodes, specifically inside the taskmanager container, using something
>> like "kubectl exec pod/flink-taskmanager-blahblah -c taskmanager -- docker
>> login -u oauth2accesstoken --password $(gcloud auth print-access-token)"
>> - Again because I'm using an unreleased Beam SDK (due to a bug whose fix
>> will be released in 2.24), I had to also build a custom Flink job server
>> jar and point to it via --flink_job_server_jar.
>>
>> In the end, I got my pipeline to start, create the uber jar (about 240MB
>> in size), take a few minutes to transmit it to Flink (which is a long time,
>> but it'll do for a prototype); the Flink UI was displaying the pipeline,
>> and was able to *start* the worker container - however it quickly failed
>> with the following error:
>>
>> 2020/08/28 15:49:09 Initializing python harness: /opt/apache/beam/boot
>> --id=1-1 --provision_endpoint=localhost:45111
>> 2020/08/28 15:49:09 Failed to retrieve staged files: failed to get
>> manifest
>> caused by:
>> rpc error: code = Unimplemented desc = Method not found:
>> org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest
>> (followed by a bunch of other garbage)
>>
>> I'm assuming this might be because I got tangled in my custom images
>> related to the unreleased Beam SDK, and should be fixed if running on clean
>> Beam 2.24.
>>
>> Thank you again!
>>
>> On Fri, Aug 28, 2020 at 10:21 AM Eugene Kirpichov <ekirpic...@gmail.com>
>> wrote:
>>
>>> Holy shit, thanks Sam, this is more help than I could have asked for!!
>>> I'll give this a shot later today and report back.
>>>
>>> On Thu, Aug 27, 2020 at 10:27 PM Sam Bourne <samb...@gmail.com> wrote:
>>>
>>>> Hi Eugene!
>>>>
>>>> I’m struggling to find complete documentation on how to do this. There
>>>> seems to be lots of conflicting or incomplete information: several ways to
>>>> deploy Flink, several ways to get Beam working with it, bizarre
>>>> StackOverflow questions, and no documentation explaining a complete working
>>>> example.
>>>>
>>>> This *is* possible and I went through all the same frustrations of
>>>> sparse and confusing documentation. I’m glossing over a lot of details, but
>>>> the key thing was setting up the flink taskworker(s) to run docker. This
>>>> requires running docker-in-docker as the taskworker itself is a docker
>>>> container in k8s.
>>>>
>>>> First create a custom flink container with docker:
>>>>
>>>> # docker-flink Dockerfile
>>>>
>>>> FROM flink:1.10
>>>> # install docker
>>>> RUN apt-get ...
>>>>
>>>> Then setup the taskmanager deployment to use a sidecar docker-in-docker
>>>> service. This dind service is where the python sdk harness container
>>>> actually runs.
>>>>
>>>> kind: Deployment
>>>> ...
>>>>   containers:
>>>>   - name: docker
>>>>     image: docker:19.03.5-dind
>>>>     ...
>>>>   - name: taskmanger
>>>>     image: myregistry:5000/docker-flink:1.10
>>>>     env:
>>>>     - name: DOCKER_HOST
>>>>       value: tcp://localhost:2375
>>>> ...
>>>>
>>>> I quickly threw all these pieces together in a repo here:
>>>> https://github.com/sambvfx/beam-flink-k8s
>>>>
>>>> I added a working (via minikube) step-by-step in the README to prove to
>>>> myself that I didn’t miss anything, but feel free to submit any PRs if you
>>>> want to add anything useful.
>>>>
>>>> The documents you linked are very informative. It would be great to
>>>> aggregate all this into digestible documentation. Let me know if you have
>>>> any further questions!
>>>>
>>>> Cheers,
>>>> Sam
>>>>
>>>> On Thu, Aug 27, 2020 at 10:25 AM Eugene Kirpichov <ekirpic...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Kyle,
>>>>>
>>>>> Thanks for the response!
>>>>>
>>>>> On Wed, Aug 26, 2020 at 5:28 PM Kyle Weaver <kcwea...@google.com>
>>>>> wrote:
>>>>>
>>>>>> > - With the Flink operator, I was able to submit a Beam job, but hit
>>>>>> the issue that I need Docker installed on my Flink nodes. I haven't yet
>>>>>> tried changing the operator's yaml files to add Docker inside them.
>>>>>>
>>>>>> Running Beam workers via Docker on the Flink nodes is not recommended
>>>>>> (and probably not even possible), since the Flink nodes are themselves
>>>>>> already running inside Docker containers. Running workers as sidecars
>>>>>> avoids that problem. For example:
>>>>>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_flink_cluster.yaml#L17-L20
>>>>>>
>>>>>> The main problem with the sidecar approach is that I can't use the
>>>>> Flink cluster as a "service" for anybody to submit their jobs with custom
>>>>> containers - the container version is fixed.
>>>>> Do I understand it correctly?
>>>>> Seems like the Docker-in-Docker approach is viable, and is mentioned
>>>>> in the Beam Flink K8s design doc
>>>>> <https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.dtj1gnks47dq>
>>>>> .
>>>>>
>>>>>
>>>>>> > I also haven't tried this
>>>>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md>
>>>>>>  yet
>>>>>> because it implies submitting jobs using "kubectl apply"  which is weird 
>>>>>> -
>>>>>> why not just submit it through the Flink job server?
>>>>>>
>>>>>> I'm guessing it goes through k8s for monitoring purposes. I see no
>>>>>> reason it shouldn't be possible to submit to the job server directly
>>>>>> through Python, network permitting, though I haven't tried this.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 26, 2020 at 4:10 PM Eugene Kirpichov <
>>>>>> ekirpic...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> I'm still working with Pachama <https://pachama.com/> right now; we
>>>>>>> have a Kubernetes Engine cluster on GCP and want to run Beam Python 
>>>>>>> batch
>>>>>>> pipelines with custom containers against it.
>>>>>>> Flink and Cloud Dataflow are the two options; Cloud Dataflow doesn't
>>>>>>> support custom containers for batch pipelines yes so we're going with 
>>>>>>> Flink.
>>>>>>>
>>>>>>> I'm struggling to find complete documentation on how to do this.
>>>>>>> There seems to be lots of conflicting or incomplete information: several
>>>>>>> ways to deploy Flink, several ways to get Beam working with it, bizarre
>>>>>>> StackOverflow questions, and no documentation explaining a complete 
>>>>>>> working
>>>>>>> example.
>>>>>>>
>>>>>>> == My requests ==
>>>>>>> * Could people briefly share their working setup? Would be good to
>>>>>>> know which directions are promising.
>>>>>>> * It would be particularly helpful if someone could volunteer an
>>>>>>> hour of their time to talk to me about their working Beam/Flink/k8s 
>>>>>>> setup.
>>>>>>> It's for a good cause (fixing the planet :) ) and on my side I 
>>>>>>> volunteer to
>>>>>>> write up the findings to share with the community so others suffer less.
>>>>>>>
>>>>>>> == Appendix: My findings so far ==
>>>>>>> There are multiple ways to deploy Flink on k8s:
>>>>>>> - The GCP marketplace Flink operator
>>>>>>> <https://cloud.google.com/blog/products/data-analytics/open-source-processing-engines-for-kubernetes>
>>>>>>>  (couldn't
>>>>>>> get it to work) and the respective CLI version
>>>>>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator> (buggy,
>>>>>>> but I got it working)
>>>>>>> - https://github.com/lyft/flinkk8soperator (haven't tried)
>>>>>>> - Flink's native k8s support
>>>>>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html>
>>>>>>>  (super
>>>>>>> easy to get working)
>>>>>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html>
>>>>>>>
>>>>>>> I confirmed that my Flink cluster was operational by running a
>>>>>>> simple Wordcount job, initiated from my machine. However I wasn't yet 
>>>>>>> able
>>>>>>> to get Beam working:
>>>>>>>
>>>>>>> - With the Flink operator, I was able to submit a Beam job, but hit
>>>>>>> the issue that I need Docker installed on my Flink nodes. I haven't yet
>>>>>>> tried changing the operator's yaml files to add Docker inside them. I 
>>>>>>> also
>>>>>>> haven't tried this
>>>>>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md>
>>>>>>> yet because it implies submitting jobs using "kubectl apply"  which is
>>>>>>> weird - why not just submit it through the Flink job server?
>>>>>>>
>>>>>>> - With Flink's native k8s support, I tried two things:
>>>>>>>   - Creating a fat portable jar using  --output_executable_path. The
>>>>>>> jar is huge (200+MB) and takes forever to upload to my Flink cluster - 
>>>>>>> this
>>>>>>> is a non-starter. But if I actually upload it, then I hit the same issue
>>>>>>> with lacking Docker. Haven't tried fixing it yet.
>>>>>>>   - Simply running my pipeline --runner=FlinkRunner
>>>>>>> --environment_type=DOCKER --flink_master=$PUBLIC_IP:8081. The Java 
>>>>>>> process
>>>>>>> appears to send 1+GB of data to somewhere, but the job never even 
>>>>>>> starts.
>>>>>>>
>>>>>>> I looked at a few conference talks:
>>>>>>> -
>>>>>>> https://www.cncf.io/wp-content/uploads/2020/02/CNCF-Webinar_-Apache-Flink-on-Kubernetes-Operator-1.pdf
>>>>>>> - seems to imply that I need to add a Beam worker "sidecar" to the Flink
>>>>>>> workers; and that I need to submit my job using "kubectl apply".
>>>>>>> - https://www.youtube.com/watch?v=8k1iezoc5Sc which also mentions
>>>>>>> the sidecar, but also mentions the fat jar option
>>>>>>>
>>>>>>> --
>>>>>>> Eugene Kirpichov
>>>>>>> http://www.linkedin.com/in/eugenekirpichov
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Eugene Kirpichov
>>>>> http://www.linkedin.com/in/eugenekirpichov
>>>>>
>>>>
>>>
>>> --
>>> Eugene Kirpichov
>>> http://www.linkedin.com/in/eugenekirpichov
>>>
>>
>>
>> --
>> Eugene Kirpichov
>> http://www.linkedin.com/in/eugenekirpichov
>>
>
>
> --
> Eugene Kirpichov
> http://www.linkedin.com/in/eugenekirpichov
>

Re: Getting Beam(Python)-on-Flink-on-k8s to work

Reply via email to