I would not be surprised if there was something weird going on with Docker
> in Docker. The defaults mostly work fine when an external SDK harness is
> used [1].
>
Can you provide more information on the exception you got? (I'm
> particularly interested in the line number).
>
The actual error is a bit tricky to find but if you monitor the docker logs
from within the taskmanager pod you can find it failing when the SDK
harness boot.go attempts to pull the the artifacts from the artifact
endpoint [1]
[1]
https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L139

2020/09/22 22:07:51 Initializing python harness: /opt/apache/beam/boot
--id=1-1 --provision_endpoint=localhost:45775
2020/09/22 22:07:59 Failed to retrieve staged files: failed to
retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for
/tmp/staged/pickled_main_session
    caused by:
rpc error: code = Unknown desc = ; failed to retrieve chunk for
/tmp/staged/pickled_main_session

I can hit the jobserver fine from my taskmanager pod, as well as from
within a SDK container I spin up manually (with —network host):

root@flink-taskmanager-56c6bdb6dd-49md7:/opt/apache/beam# ping
flink-beam-jobserver
PING flink-beam-jobserver.default.svc.cluster.local (10.103.16.129)
56(84) bytes of data.

I don’t see how this would work if the endpoint hostname is localhost. I’ll
explore how this is working in the flink-on-k8s-operator.

Thanks for taking a look!
Sam

On Tue, Sep 22, 2020 at 2:48 PM Kyle Weaver <kcwea...@google.com> wrote:

> > The issue is that the jobserver does not provide the proper endpoints to
> the SDK harness when it submits the job to flink.
>
> I would not be surprised if there was something weird going on with Docker
> in Docker. The defaults mostly work fine when an external SDK harness is
> used [1].
>
> Can you provide more information on the exception you got? (I'm
> particularly interested in the line number).
>
> > The issue is that the jobserver does not provide the proper endpoints to
> the SDK harness when it submits the job to flink.
>
> More information about this failure mode would be helpful as well.
>
> [1]
> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml
>
>
> On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <samb...@gmail.com> wrote:
>
>> Hello beam community!
>>
>> I’m looking for some help solving an issue running a beam job on flink
>> using --environment_type DOCKER.
>>
>> I have a flink cluster running in kubernetes configured so the
>> taskworkers can run docker [1]. I’m attempting to deploy a flink jobserver
>> in the cluster. The issue is that the jobserver does not provide the proper
>> endpoints to the SDK harness when it submits the job to flink. It typically
>> provides something like localhost:34567 using the hostname the grpc
>> server was bound to. There is a jobserver flag --job-host that will bind
>> the grpc server to this provided hostname, but I cannot seem to get it to
>> bind to the k8s jobservice Service name [2]. I’ve tried different flavors
>> of FQDNs but haven’t had any luck.
>>
>> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - 
>> ArtifactStagingService started on flink-beam-jobserver:8098
>> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - 
>> Exception during job server creation
>> java.io.IOException: Failed to bind
>> ...
>>
>> Does anyone have some experience with this that could help provide some
>> guidance?
>>
>> Cheers,
>> Sam
>>
>> [1] https://github.com/sambvfx/beam-flink-k8s
>> [2]
>> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29
>>
>

Reply via email to