P.S. Ironic how back in 2018 I was TL-ing the portable runners effort for a few months on Google side, and now I need community help to get it to work at all. Still pretty miraculous how far Beam's portability has come since then, even if it has a steep learning curve.
On Fri, Aug 28, 2020 at 4:54 PM Eugene Kirpichov <ekirpic...@gmail.com> wrote: > Hi Sam, > > You're a wizard - this got me *way* farther than my previous attempts. > Here's a PR https://github.com/sambvfx/beam-flink-k8s/pull/1 with a > couple of changes I had to make. > > I had to make some additional changes that do not make sense to share, but > here they are for the record: > - Because I'm running on k8s engine and not minikube, I had to put the > docker-flink image on GCR, changing flink.yaml "image: docker-flink:1.10" > -> "image: us.gcr.io/$PROJECT_ID/docker-flink:1.10". I of course also had > to build and push the container. > - Because I'm running with a custom container based on an unreleased > version of Beam, I had to push my custom container to GCR too, and change > your instructions to use that image name instead of the default one > - To fetch the containers from GCR, I had to log into Docker inside the > Flink nodes, specifically inside the taskmanager container, using something > like "kubectl exec pod/flink-taskmanager-blahblah -c taskmanager -- docker > login -u oauth2accesstoken --password $(gcloud auth print-access-token)" > - Again because I'm using an unreleased Beam SDK (due to a bug whose fix > will be released in 2.24), I had to also build a custom Flink job server > jar and point to it via --flink_job_server_jar. > > In the end, I got my pipeline to start, create the uber jar (about 240MB > in size), take a few minutes to transmit it to Flink (which is a long time, > but it'll do for a prototype); the Flink UI was displaying the pipeline, > and was able to *start* the worker container - however it quickly failed > with the following error: > > 2020/08/28 15:49:09 Initializing python harness: /opt/apache/beam/boot > --id=1-1 --provision_endpoint=localhost:45111 > 2020/08/28 15:49:09 Failed to retrieve staged files: failed to get manifest > caused by: > rpc error: code = Unimplemented desc = Method not found: > org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest > (followed by a bunch of other garbage) > > I'm assuming this might be because I got tangled in my custom images > related to the unreleased Beam SDK, and should be fixed if running on clean > Beam 2.24. > > Thank you again! > > On Fri, Aug 28, 2020 at 10:21 AM Eugene Kirpichov <ekirpic...@gmail.com> > wrote: > >> Holy shit, thanks Sam, this is more help than I could have asked for!! >> I'll give this a shot later today and report back. >> >> On Thu, Aug 27, 2020 at 10:27 PM Sam Bourne <samb...@gmail.com> wrote: >> >>> Hi Eugene! >>> >>> I’m struggling to find complete documentation on how to do this. There >>> seems to be lots of conflicting or incomplete information: several ways to >>> deploy Flink, several ways to get Beam working with it, bizarre >>> StackOverflow questions, and no documentation explaining a complete working >>> example. >>> >>> This *is* possible and I went through all the same frustrations of >>> sparse and confusing documentation. I’m glossing over a lot of details, but >>> the key thing was setting up the flink taskworker(s) to run docker. This >>> requires running docker-in-docker as the taskworker itself is a docker >>> container in k8s. >>> >>> First create a custom flink container with docker: >>> >>> # docker-flink Dockerfile >>> >>> FROM flink:1.10 >>> # install docker >>> RUN apt-get ... >>> >>> Then setup the taskmanager deployment to use a sidecar docker-in-docker >>> service. This dind service is where the python sdk harness container >>> actually runs. >>> >>> kind: Deployment >>> ... >>> containers: >>> - name: docker >>> image: docker:19.03.5-dind >>> ... >>> - name: taskmanger >>> image: myregistry:5000/docker-flink:1.10 >>> env: >>> - name: DOCKER_HOST >>> value: tcp://localhost:2375 >>> ... >>> >>> I quickly threw all these pieces together in a repo here: >>> https://github.com/sambvfx/beam-flink-k8s >>> >>> I added a working (via minikube) step-by-step in the README to prove to >>> myself that I didn’t miss anything, but feel free to submit any PRs if you >>> want to add anything useful. >>> >>> The documents you linked are very informative. It would be great to >>> aggregate all this into digestible documentation. Let me know if you have >>> any further questions! >>> >>> Cheers, >>> Sam >>> >>> On Thu, Aug 27, 2020 at 10:25 AM Eugene Kirpichov <ekirpic...@gmail.com> >>> wrote: >>> >>>> Hi Kyle, >>>> >>>> Thanks for the response! >>>> >>>> On Wed, Aug 26, 2020 at 5:28 PM Kyle Weaver <kcwea...@google.com> >>>> wrote: >>>> >>>>> > - With the Flink operator, I was able to submit a Beam job, but hit >>>>> the issue that I need Docker installed on my Flink nodes. I haven't yet >>>>> tried changing the operator's yaml files to add Docker inside them. >>>>> >>>>> Running Beam workers via Docker on the Flink nodes is not recommended >>>>> (and probably not even possible), since the Flink nodes are themselves >>>>> already running inside Docker containers. Running workers as sidecars >>>>> avoids that problem. For example: >>>>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_flink_cluster.yaml#L17-L20 >>>>> >>>>> The main problem with the sidecar approach is that I can't use the >>>> Flink cluster as a "service" for anybody to submit their jobs with custom >>>> containers - the container version is fixed. >>>> Do I understand it correctly? >>>> Seems like the Docker-in-Docker approach is viable, and is mentioned in >>>> the Beam Flink K8s design doc >>>> <https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.dtj1gnks47dq> >>>> . >>>> >>>> >>>>> > I also haven't tried this >>>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md> >>>>> yet >>>>> because it implies submitting jobs using "kubectl apply" which is weird - >>>>> why not just submit it through the Flink job server? >>>>> >>>>> I'm guessing it goes through k8s for monitoring purposes. I see no >>>>> reason it shouldn't be possible to submit to the job server directly >>>>> through Python, network permitting, though I haven't tried this. >>>>> >>>>> >>>>> >>>>> On Wed, Aug 26, 2020 at 4:10 PM Eugene Kirpichov <ekirpic...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi folks, >>>>>> >>>>>> I'm still working with Pachama <https://pachama.com/> right now; we >>>>>> have a Kubernetes Engine cluster on GCP and want to run Beam Python batch >>>>>> pipelines with custom containers against it. >>>>>> Flink and Cloud Dataflow are the two options; Cloud Dataflow doesn't >>>>>> support custom containers for batch pipelines yes so we're going with >>>>>> Flink. >>>>>> >>>>>> I'm struggling to find complete documentation on how to do this. >>>>>> There seems to be lots of conflicting or incomplete information: several >>>>>> ways to deploy Flink, several ways to get Beam working with it, bizarre >>>>>> StackOverflow questions, and no documentation explaining a complete >>>>>> working >>>>>> example. >>>>>> >>>>>> == My requests == >>>>>> * Could people briefly share their working setup? Would be good to >>>>>> know which directions are promising. >>>>>> * It would be particularly helpful if someone could volunteer an hour >>>>>> of their time to talk to me about their working Beam/Flink/k8s setup. >>>>>> It's >>>>>> for a good cause (fixing the planet :) ) and on my side I volunteer to >>>>>> write up the findings to share with the community so others suffer less. >>>>>> >>>>>> == Appendix: My findings so far == >>>>>> There are multiple ways to deploy Flink on k8s: >>>>>> - The GCP marketplace Flink operator >>>>>> <https://cloud.google.com/blog/products/data-analytics/open-source-processing-engines-for-kubernetes> >>>>>> (couldn't >>>>>> get it to work) and the respective CLI version >>>>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator> (buggy, >>>>>> but I got it working) >>>>>> - https://github.com/lyft/flinkk8soperator (haven't tried) >>>>>> - Flink's native k8s support >>>>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html> >>>>>> (super >>>>>> easy to get working) >>>>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html> >>>>>> >>>>>> I confirmed that my Flink cluster was operational by running a simple >>>>>> Wordcount job, initiated from my machine. However I wasn't yet able to >>>>>> get >>>>>> Beam working: >>>>>> >>>>>> - With the Flink operator, I was able to submit a Beam job, but hit >>>>>> the issue that I need Docker installed on my Flink nodes. I haven't yet >>>>>> tried changing the operator's yaml files to add Docker inside them. I >>>>>> also >>>>>> haven't tried this >>>>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md> >>>>>> yet because it implies submitting jobs using "kubectl apply" which is >>>>>> weird - why not just submit it through the Flink job server? >>>>>> >>>>>> - With Flink's native k8s support, I tried two things: >>>>>> - Creating a fat portable jar using --output_executable_path. The >>>>>> jar is huge (200+MB) and takes forever to upload to my Flink cluster - >>>>>> this >>>>>> is a non-starter. But if I actually upload it, then I hit the same issue >>>>>> with lacking Docker. Haven't tried fixing it yet. >>>>>> - Simply running my pipeline --runner=FlinkRunner >>>>>> --environment_type=DOCKER --flink_master=$PUBLIC_IP:8081. The Java >>>>>> process >>>>>> appears to send 1+GB of data to somewhere, but the job never even starts. >>>>>> >>>>>> I looked at a few conference talks: >>>>>> - >>>>>> https://www.cncf.io/wp-content/uploads/2020/02/CNCF-Webinar_-Apache-Flink-on-Kubernetes-Operator-1.pdf >>>>>> - seems to imply that I need to add a Beam worker "sidecar" to the Flink >>>>>> workers; and that I need to submit my job using "kubectl apply". >>>>>> - https://www.youtube.com/watch?v=8k1iezoc5Sc which also mentions >>>>>> the sidecar, but also mentions the fat jar option >>>>>> >>>>>> -- >>>>>> Eugene Kirpichov >>>>>> http://www.linkedin.com/in/eugenekirpichov >>>>>> >>>>> >>>> >>>> -- >>>> Eugene Kirpichov >>>> http://www.linkedin.com/in/eugenekirpichov >>>> >>> >> >> -- >> Eugene Kirpichov >> http://www.linkedin.com/in/eugenekirpichov >> > > > -- > Eugene Kirpichov > http://www.linkedin.com/in/eugenekirpichov > -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov