P.S. Ironic how back in 2018 I was TL-ing the portable runners effort for a
few months on Google side, and now I need community help to get it to work
at all.
Still pretty miraculous how far Beam's portability has come since
then, even if it has a steep learning curve.

On Fri, Aug 28, 2020 at 4:54 PM Eugene Kirpichov <ekirpic...@gmail.com>
wrote:

> Hi Sam,
>
> You're a wizard - this got me *way* farther than my previous attempts.
> Here's a PR https://github.com/sambvfx/beam-flink-k8s/pull/1 with a
> couple of changes I had to make.
>
> I had to make some additional changes that do not make sense to share, but
> here they are for the record:
> - Because I'm running on k8s engine and not minikube, I had to put the
> docker-flink image on GCR, changing flink.yaml "image: docker-flink:1.10"
> -> "image: us.gcr.io/$PROJECT_ID/docker-flink:1.10". I of course also had
> to build and push the container.
> - Because I'm running with a custom container based on an unreleased
> version of Beam, I had to push my custom container to GCR too, and change
> your instructions to use that image name instead of the default one
> - To fetch the containers from GCR, I had to log into Docker inside the
> Flink nodes, specifically inside the taskmanager container, using something
> like "kubectl exec pod/flink-taskmanager-blahblah -c taskmanager -- docker
> login -u oauth2accesstoken --password $(gcloud auth print-access-token)"
> - Again because I'm using an unreleased Beam SDK (due to a bug whose fix
> will be released in 2.24), I had to also build a custom Flink job server
> jar and point to it via --flink_job_server_jar.
>
> In the end, I got my pipeline to start, create the uber jar (about 240MB
> in size), take a few minutes to transmit it to Flink (which is a long time,
> but it'll do for a prototype); the Flink UI was displaying the pipeline,
> and was able to *start* the worker container - however it quickly failed
> with the following error:
>
> 2020/08/28 15:49:09 Initializing python harness: /opt/apache/beam/boot
> --id=1-1 --provision_endpoint=localhost:45111
> 2020/08/28 15:49:09 Failed to retrieve staged files: failed to get manifest
> caused by:
> rpc error: code = Unimplemented desc = Method not found:
> org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest
> (followed by a bunch of other garbage)
>
> I'm assuming this might be because I got tangled in my custom images
> related to the unreleased Beam SDK, and should be fixed if running on clean
> Beam 2.24.
>
> Thank you again!
>
> On Fri, Aug 28, 2020 at 10:21 AM Eugene Kirpichov <ekirpic...@gmail.com>
> wrote:
>
>> Holy shit, thanks Sam, this is more help than I could have asked for!!
>> I'll give this a shot later today and report back.
>>
>> On Thu, Aug 27, 2020 at 10:27 PM Sam Bourne <samb...@gmail.com> wrote:
>>
>>> Hi Eugene!
>>>
>>> I’m struggling to find complete documentation on how to do this. There
>>> seems to be lots of conflicting or incomplete information: several ways to
>>> deploy Flink, several ways to get Beam working with it, bizarre
>>> StackOverflow questions, and no documentation explaining a complete working
>>> example.
>>>
>>> This *is* possible and I went through all the same frustrations of
>>> sparse and confusing documentation. I’m glossing over a lot of details, but
>>> the key thing was setting up the flink taskworker(s) to run docker. This
>>> requires running docker-in-docker as the taskworker itself is a docker
>>> container in k8s.
>>>
>>> First create a custom flink container with docker:
>>>
>>> # docker-flink Dockerfile
>>>
>>> FROM flink:1.10
>>> # install docker
>>> RUN apt-get ...
>>>
>>> Then setup the taskmanager deployment to use a sidecar docker-in-docker
>>> service. This dind service is where the python sdk harness container
>>> actually runs.
>>>
>>> kind: Deployment
>>> ...
>>>   containers:
>>>   - name: docker
>>>     image: docker:19.03.5-dind
>>>     ...
>>>   - name: taskmanger
>>>     image: myregistry:5000/docker-flink:1.10
>>>     env:
>>>     - name: DOCKER_HOST
>>>       value: tcp://localhost:2375
>>> ...
>>>
>>> I quickly threw all these pieces together in a repo here:
>>> https://github.com/sambvfx/beam-flink-k8s
>>>
>>> I added a working (via minikube) step-by-step in the README to prove to
>>> myself that I didn’t miss anything, but feel free to submit any PRs if you
>>> want to add anything useful.
>>>
>>> The documents you linked are very informative. It would be great to
>>> aggregate all this into digestible documentation. Let me know if you have
>>> any further questions!
>>>
>>> Cheers,
>>> Sam
>>>
>>> On Thu, Aug 27, 2020 at 10:25 AM Eugene Kirpichov <ekirpic...@gmail.com>
>>> wrote:
>>>
>>>> Hi Kyle,
>>>>
>>>> Thanks for the response!
>>>>
>>>> On Wed, Aug 26, 2020 at 5:28 PM Kyle Weaver <kcwea...@google.com>
>>>> wrote:
>>>>
>>>>> > - With the Flink operator, I was able to submit a Beam job, but hit
>>>>> the issue that I need Docker installed on my Flink nodes. I haven't yet
>>>>> tried changing the operator's yaml files to add Docker inside them.
>>>>>
>>>>> Running Beam workers via Docker on the Flink nodes is not recommended
>>>>> (and probably not even possible), since the Flink nodes are themselves
>>>>> already running inside Docker containers. Running workers as sidecars
>>>>> avoids that problem. For example:
>>>>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_flink_cluster.yaml#L17-L20
>>>>>
>>>>> The main problem with the sidecar approach is that I can't use the
>>>> Flink cluster as a "service" for anybody to submit their jobs with custom
>>>> containers - the container version is fixed.
>>>> Do I understand it correctly?
>>>> Seems like the Docker-in-Docker approach is viable, and is mentioned in
>>>> the Beam Flink K8s design doc
>>>> <https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.dtj1gnks47dq>
>>>> .
>>>>
>>>>
>>>>> > I also haven't tried this
>>>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md>
>>>>>  yet
>>>>> because it implies submitting jobs using "kubectl apply"  which is weird -
>>>>> why not just submit it through the Flink job server?
>>>>>
>>>>> I'm guessing it goes through k8s for monitoring purposes. I see no
>>>>> reason it shouldn't be possible to submit to the job server directly
>>>>> through Python, network permitting, though I haven't tried this.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 26, 2020 at 4:10 PM Eugene Kirpichov <ekirpic...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> I'm still working with Pachama <https://pachama.com/> right now; we
>>>>>> have a Kubernetes Engine cluster on GCP and want to run Beam Python batch
>>>>>> pipelines with custom containers against it.
>>>>>> Flink and Cloud Dataflow are the two options; Cloud Dataflow doesn't
>>>>>> support custom containers for batch pipelines yes so we're going with 
>>>>>> Flink.
>>>>>>
>>>>>> I'm struggling to find complete documentation on how to do this.
>>>>>> There seems to be lots of conflicting or incomplete information: several
>>>>>> ways to deploy Flink, several ways to get Beam working with it, bizarre
>>>>>> StackOverflow questions, and no documentation explaining a complete 
>>>>>> working
>>>>>> example.
>>>>>>
>>>>>> == My requests ==
>>>>>> * Could people briefly share their working setup? Would be good to
>>>>>> know which directions are promising.
>>>>>> * It would be particularly helpful if someone could volunteer an hour
>>>>>> of their time to talk to me about their working Beam/Flink/k8s setup. 
>>>>>> It's
>>>>>> for a good cause (fixing the planet :) ) and on my side I volunteer to
>>>>>> write up the findings to share with the community so others suffer less.
>>>>>>
>>>>>> == Appendix: My findings so far ==
>>>>>> There are multiple ways to deploy Flink on k8s:
>>>>>> - The GCP marketplace Flink operator
>>>>>> <https://cloud.google.com/blog/products/data-analytics/open-source-processing-engines-for-kubernetes>
>>>>>>  (couldn't
>>>>>> get it to work) and the respective CLI version
>>>>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator> (buggy,
>>>>>> but I got it working)
>>>>>> - https://github.com/lyft/flinkk8soperator (haven't tried)
>>>>>> - Flink's native k8s support
>>>>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html>
>>>>>>  (super
>>>>>> easy to get working)
>>>>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html>
>>>>>>
>>>>>> I confirmed that my Flink cluster was operational by running a simple
>>>>>> Wordcount job, initiated from my machine. However I wasn't yet able to 
>>>>>> get
>>>>>> Beam working:
>>>>>>
>>>>>> - With the Flink operator, I was able to submit a Beam job, but hit
>>>>>> the issue that I need Docker installed on my Flink nodes. I haven't yet
>>>>>> tried changing the operator's yaml files to add Docker inside them. I 
>>>>>> also
>>>>>> haven't tried this
>>>>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md>
>>>>>> yet because it implies submitting jobs using "kubectl apply"  which is
>>>>>> weird - why not just submit it through the Flink job server?
>>>>>>
>>>>>> - With Flink's native k8s support, I tried two things:
>>>>>>   - Creating a fat portable jar using  --output_executable_path. The
>>>>>> jar is huge (200+MB) and takes forever to upload to my Flink cluster - 
>>>>>> this
>>>>>> is a non-starter. But if I actually upload it, then I hit the same issue
>>>>>> with lacking Docker. Haven't tried fixing it yet.
>>>>>>   - Simply running my pipeline --runner=FlinkRunner
>>>>>> --environment_type=DOCKER --flink_master=$PUBLIC_IP:8081. The Java 
>>>>>> process
>>>>>> appears to send 1+GB of data to somewhere, but the job never even starts.
>>>>>>
>>>>>> I looked at a few conference talks:
>>>>>> -
>>>>>> https://www.cncf.io/wp-content/uploads/2020/02/CNCF-Webinar_-Apache-Flink-on-Kubernetes-Operator-1.pdf
>>>>>> - seems to imply that I need to add a Beam worker "sidecar" to the Flink
>>>>>> workers; and that I need to submit my job using "kubectl apply".
>>>>>> - https://www.youtube.com/watch?v=8k1iezoc5Sc which also mentions
>>>>>> the sidecar, but also mentions the fat jar option
>>>>>>
>>>>>> --
>>>>>> Eugene Kirpichov
>>>>>> http://www.linkedin.com/in/eugenekirpichov
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Eugene Kirpichov
>>>> http://www.linkedin.com/in/eugenekirpichov
>>>>
>>>
>>
>> --
>> Eugene Kirpichov
>> http://www.linkedin.com/in/eugenekirpichov
>>
>
>
> --
> Eugene Kirpichov
> http://www.linkedin.com/in/eugenekirpichov
>


-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov

Reply via email to