Re: Trying to run Beam on Spark cluster

Mark Striebeck Wed, 29 Sep 2021 16:55:49 -0700

Thanks Kyle,

On Fri, Sep 24, 2021 at 1:48 PM Kyle Weaver <kcwea...@google.com> wrote:


> Hi Mark. Looks like a problem with artifact staging. PortableRunner
> implicitly requires a directory (configurable with --artifacts_dir, under
> /tmp by default) that is accessible by both the job server and Beam worker.
>

Hmmm, I guess I could create an NFS share between the machines and use that
for the artifacts_dir. But if I use enironment_type=DOCKER, the docker
image won't have access to that. Is there some easy way to modify the
docker command that the worker runs when it stars the docker image to map
this directory (via '-v') into the docker image?


> You should be able to get around this by using --runner SparkRunner
> instead:
>
> python -m apache_beam.examples.wordcount
>  gs://datapipeline-output/shakespeare-alls-11.txt --output
> gs://datapipeline-output/output/   --project august-ascent-325423
> --environment_type=DOCKER *--runner SparkRunner
> --spark_rest_url http://hostname:6066 <http://hostname:6066>*
>
> This requires you to enable REST on your Spark master by putting
> `spark.master.rest.enabled` in your config, and then setting the Beam
> pipeline option --spark_rest_url to use its address (6066 is the default
> port).
>

I'll try that next while waiting for the answer above.

Thanks
     Mark

>
> This starts the job server for you, so you don't need to do that ahead of
> time.
>
> Best,
> Kyle
>
> On Sun, Sep 19, 2021 at 12:22 PM Mark Striebeck <mark.strieb...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am trying to run beam on a small spark cluster. I setup spark (master
>> plus one slave). I am using the portable runner and invoke the beam
>> pipeline with:
>>
>> python -m apache_beam.examples.wordcount
>>  gs://datapipeline-output/shakespeare-alls-11.txt --output
>> gs://datapipeline-output/output/   --project august-ascent-325423 --runner
>> PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER
>>
>> I always get an error:
>> Caused by: java.util.concurrent.TimeoutException: Timed out while waiting
>> for command 'docker run -d --network=host --env=DOCKER_MAC_CONTAINER=null
>> apache/beam_python3.8_sdk:2.32.0 --id=4-1
>> --provision_endpoint=localhost:46757'
>>
>> It takes ~2.5 minutes to pull the beam image which should be enough. But
>> I pulled the image manually (docker pull apache/beam_python3.8_sdk:2.32.0)
>> and then tried to run the pipeline again.
>>
>> Now, when I run the pipeline I get an error:
>> java.io.FileNotFoundException:
>> /tmp/beam-artifact-staging/60321f712323c195764ab31b3e205b228a405fbb80b50fafa67b38b21959c63f/1-ref_Environment_default_e-pickled_main_session
>> (No such file or directory)
>>
>> and then further down
>>
>> ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException:
>> java.lang.IllegalStateException: No container running for id
>> 7014a9ea98dc0b3f453a9d3860aff43ba42214195d2240d7cefcefcfabf93879
>>
>> (here is the full strack trace:
>> https://drive.google.com/file/d/1mRzt8G7I9Akkya48KfAbrqPp8wRCzXDe/view)
>>
>> Any pointer or idea is appreciated (sorry, if this is something obvious -
>> I'm still pretty new to beam/spark).
>>
>> Thanks
>>       Mark
>>
>

Re: Trying to run Beam on Spark cluster

Reply via email to