Thank you Kyle for clarifying things for me. I've confirmed it works by simply sharing the artifact staging volume between the jobserver and taskmanager pods. This works fine with the dind setup using the docker environment.
Thanks again, Sam On Tue, Sep 22, 2020 at 4:37 PM Kyle Weaver <kcwea...@google.com> wrote: > > It was my understanding that the client first uploads the artifacts to > the jobserver and then the SDK harness will pull in these artifacts from > the jobserver over a gRPC port. > Not quite. The artifact endpoint is for communicating which artifacts are > needed, and where to find them. But the SDK harness pulls the actual > artifacts itself. > > > Do the jobserver and the taskmanager need to share the artifact staging > volume. > > More precisely, the job server and the SDK harness need to share the > artifact staging volume (which is why we generally recommend using a > distributed filesystem for this purpose if possible). > > General note: there is never any direct communication between the job > server and the SDK harness. Usually it goes Beam job server -> Flink job > manager -> Flink task manager -> Beam SDK harness. > > On Tue, Sep 22, 2020 at 4:20 PM Sam Bourne <samb...@gmail.com> wrote: > >> It was my understanding that the client first uploads the artifacts to >> the jobserver and then the SDK harness will pull in these artifacts from >> the jobserver over a gRPC port. >> >> I see the artifacts on the jobserver while the job is attempting to run: >> >> root@flink-beam-jobserver-9fccb99b8-6mhtq >> :/tmp/beam-artifact-staging/3024e5d862fef831e830945b2d3e4e9511e0423bfb9c48de75aa2b3b67decce4 >> >> Do the jobserver and the taskmanager need to share the artifact staging >> volume? >> >> On Tue, Sep 22, 2020 at 4:04 PM Kyle Weaver <kcwea...@google.com> wrote: >> >>> > rpc error: code = Unknown desc = ; failed to retrieve chunk for >>> /tmp/staged/pickled_main_session >>> >>> Are you sure that's due to a networking issue, and not a problem with >>> the filesystem / volume mounting? >>> >>> On Tue, Sep 22, 2020 at 3:55 PM Sam Bourne <samb...@gmail.com> wrote: >>> >>>> I would not be surprised if there was something weird going on with >>>>> Docker in Docker. The defaults mostly work fine when an external SDK >>>>> harness is used [1]. >>>>> >>>> Can you provide more information on the exception you got? (I'm >>>>> particularly interested in the line number). >>>>> >>>> The actual error is a bit tricky to find but if you monitor the docker >>>> logs from within the taskmanager pod you can find it failing when the SDK >>>> harness boot.go attempts to pull the the artifacts from the artifact >>>> endpoint [1] >>>> [1] >>>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L139 >>>> >>>> 2020/09/22 22:07:51 Initializing python harness: /opt/apache/beam/boot >>>> --id=1-1 --provision_endpoint=localhost:45775 >>>> 2020/09/22 22:07:59 Failed to retrieve staged files: failed to retrieve >>>> /tmp/staged in 3 attempts: failed to retrieve chunk for >>>> /tmp/staged/pickled_main_session >>>> caused by: >>>> rpc error: code = Unknown desc = ; failed to retrieve chunk for >>>> /tmp/staged/pickled_main_session >>>> >>>> I can hit the jobserver fine from my taskmanager pod, as well as from >>>> within a SDK container I spin up manually (with —network host): >>>> >>>> root@flink-taskmanager-56c6bdb6dd-49md7:/opt/apache/beam# ping >>>> flink-beam-jobserver >>>> PING flink-beam-jobserver.default.svc.cluster.local (10.103.16.129) 56(84) >>>> bytes of data. >>>> >>>> I don’t see how this would work if the endpoint hostname is localhost. >>>> I’ll explore how this is working in the flink-on-k8s-operator. >>>> >>>> Thanks for taking a look! >>>> Sam >>>> >>>> On Tue, Sep 22, 2020 at 2:48 PM Kyle Weaver <kcwea...@google.com> >>>> wrote: >>>> >>>>> > The issue is that the jobserver does not provide the proper >>>>> endpoints to the SDK harness when it submits the job to flink. >>>>> >>>>> I would not be surprised if there was something weird going on with >>>>> Docker in Docker. The defaults mostly work fine when an external SDK >>>>> harness is used [1]. >>>>> >>>>> Can you provide more information on the exception you got? (I'm >>>>> particularly interested in the line number). >>>>> >>>>> > The issue is that the jobserver does not provide the proper >>>>> endpoints to the SDK harness when it submits the job to flink. >>>>> >>>>> More information about this failure mode would be helpful as well. >>>>> >>>>> [1] >>>>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml >>>>> >>>>> >>>>> On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <samb...@gmail.com> wrote: >>>>> >>>>>> Hello beam community! >>>>>> >>>>>> I’m looking for some help solving an issue running a beam job on >>>>>> flink using --environment_type DOCKER. >>>>>> >>>>>> I have a flink cluster running in kubernetes configured so the >>>>>> taskworkers can run docker [1]. I’m attempting to deploy a flink >>>>>> jobserver >>>>>> in the cluster. The issue is that the jobserver does not provide the >>>>>> proper >>>>>> endpoints to the SDK harness when it submits the job to flink. It >>>>>> typically >>>>>> provides something like localhost:34567 using the hostname the grpc >>>>>> server was bound to. There is a jobserver flag --job-host that will >>>>>> bind the grpc server to this provided hostname, but I cannot seem to get >>>>>> it >>>>>> to bind to the k8s jobservice Service name [2]. I’ve tried different >>>>>> flavors of FQDNs but haven’t had any luck. >>>>>> >>>>>> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - >>>>>> ArtifactStagingService started on flink-beam-jobserver:8098 >>>>>> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - >>>>>> Exception during job server creation >>>>>> java.io.IOException: Failed to bind >>>>>> ... >>>>>> >>>>>> Does anyone have some experience with this that could help provide >>>>>> some guidance? >>>>>> >>>>>> Cheers, >>>>>> Sam >>>>>> >>>>>> [1] https://github.com/sambvfx/beam-flink-k8s >>>>>> [2] >>>>>> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29 >>>>>> >>>>>