Sam Bourne’s repo has a lot of tricks for using Flink with docker containers: https://github.com/sambvfx/beam-flink-k8s
Feel free to make a PR if you find anything has changed. -Chad On Mon, Dec 28, 2020 at 9:38 AM Kyle Weaver <[email protected]> wrote: > Using Docker workers along with the local filesystem I/O is not > recommended because the Docker workers will use their own filesystems > instead of the host filesystem. See > https://issues.apache.org/jira/browse/BEAM-5440 > > On Sun, Dec 27, 2020 at 5:01 AM Günter Hipler <[email protected]> > wrote: > >> Hi, >> >> I just tried to start a beam pipeline on a flink cluster using >> >> - the latest published beam version 2.26.0 >> - the python SDK >> - a standalone flink cluster version 1.10.1 >> - the simple pipeline I used [1] >> >> When I start the pipeline in embedded mode it works correctly (even >> pulling a jobmanager docker image) >> >> python mas_zb_demo_marc_author_count.py --input >> /swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/input/job8r1A069.format.xml >> >> --output >> /swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/output/out.txt >> >> --runner=FlinkRunner --flink_version=1.10 >> >> <logs> >> WARNING:root:Make sure that locally built Python SDK docker image has >> Python 3.7 interpreter. >> INFO:root:Using Python SDK docker image: >> apache/beam_python3.7_sdk:2.26.0. If the image is not available at >> local, we will try to pull from hub.docker.com >> </logs> >> >> python mas_zb_demo_marc_author_count.py --input >> /swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/input/job8r1A069.format.xml >> >> --output >> /swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/output/out.txt >> >> --runner=FlinkRunner --flink_version=1.10 >> >> I'm using python version >> python --version >> Python 3.7.9 >> >> Trying to use the remote stanalone cluster the job fails when fetching >> the jobmanager docker image >> python mas_zb_demo_marc_author_count.py --input >> /swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/input/job8r1A069.format.xml >> >> --output >> /swissbib_index/solrDocumentProcessing/FrequentInitialPreProcessing/data/beam/output/out.txt >> >> --runner=FlinkRunner --flink_version=1.10 flink_master=localhost:8081 >> >> <logs> >> >> java.lang.Exception: The user defined 'open()' method caused an >> exception: java.util.concurrent.TimeoutException: Timed out while >> waiting for command 'docker run -d --network=host >> --env=DOCKER_MAC_CONTAINER=null apache/beam_python3.8_sdk:2.26.0 >> --id=1-1 --provision_endpoint=localhost:41483' >> at >> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:499) >> at >> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369) >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533) >> at java.base/java.lang.Thread.run(Thread.java:834) >> >> </logs> >> >> Then I pulled the apache/beam_python3.8_sdk:2.26.0 image locally to >> avoid the timeout, which was successful, the remote job finished and the >> they were shown in the Flink dashboard. >> >> But no result was written into the given --output dir although I >> couldn't find any logs referencing this issue in the logs of Flink. >> Additionally I'm getting quite a huge amount of logs in the python >> process shell which sends the script to the cluster [2] - but I can't >> see any reason for the behaviour >> >> Thanks for any explanations for the behaviour >> >> Günter >> >> >> [1] >> >> https://gitlab.com/swissbib/lab/services/jupyter-beam-flink/-/blob/master/mas_zb_demo_marc_author_count.py >> (grep author data from bibliographic library data) >> >> [2] >> >> https://gitlab.com/swissbib/lab/services/jupyter-beam-flink/-/blob/master/notes/logging.no.output.txt >> >> >>
