Thanks Yu for the help and the tips.

I ran the following steps but my job is stuck and can't get submitted to
Dataproc and I keep getting this message in job-server:
Still waiting for startup of environment from localhost:50000 for worker id
1-1

---------------------------------------------------------------------------------------------------------
*Beam code:*
pipeline_options = PipelineOptions([
            "--runner=PortableRunner",
            "--job_endpoint=localhost:8099",
            "--environment_type=EXTERNAL",
            "--environment_config=localhost:50000"
        ])
---------------------------------------------------------------------------------------------------------
*Job Server:*
I couldn't use Docker because host networking doesn't work on Mac OS and I
used Gradle instead

./gradlew :runners:spark:3:job-server:runShadow
---------------------------------------------------------------------------------------------------------
*Beam Worker Pool:*
docker run -p=50000:50000 apache/beam_python3.7_sdk --worker_pool
---------------------------------------------------------------------------------------------------------
*SSH tunnel to the master node:*
gcloud compute ssh <my-master-node-m> \
    --project <my-gcp-project> \
    --zone <my-zone>  \
    -- -NL 7077:localhost:7077
---------------------------------------------------------------------------------------------------------

Thanks,
Mahan

On Tue, Aug 10, 2021 at 3:53 PM Yu Watanabe <yu.w.ten...@gmail.com> wrote:

> Hello .
>
> Would this page help ? I hope it helps.
>
> https://beam.apache.org/documentation/runners/spark/
>
> > Running on a pre-deployed Spark cluster
>
> 1- What's spark-master-url in case of a remote cluster on Dataproc? Is
> 7077 the master url port?
> * Yes.
>
> 2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh?
> * Job server should be able to communicate with Spark master node port
> 7077. So I believe it is Yes.
>
> 3- What's the environment_type? Can we use DOCKER? Then what's the SDK
> Harness Configuration?
> * This is the configuration of how you want  your harness container to
> spin up.
>
> https://beam.apache.org/documentation/runtime/sdk-harness-config/
>
> For DOCKER , you will need docker deployed on all spark worker nodes.
> > User code is executed within a container started on each worker node
>
> I used EXTERNAL when I did it with flink cluster before.
>
> e.g
>
> https://github.com/yuwtennis/apache-beam/blob/master/flink-session-cluster/docker/samples/src/sample.py#L14
>
> 4- Should we run the job-server outside of the Dataproc cluster or should
> we run it in the master node?
> * Depends. It could be inside or outside the master node. But if you are
> connecting to full managed service, then outside might be better.
>
> https://beam.apache.org/documentation/runners/spark/
>
> > Start JobService that will connect with the Spark master
>
> Thanks,
> Yu
>
> On Tue, Aug 10, 2021 at 7:53 PM Mahan Hosseinzadeh <mahan.h...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a Python Beam job that works on Dataflow but we would like to
>> submit it on a Spark Dataproc cluster with no Flink involvement.
>> I already spent days but failed to figure out how to use PortableRunner
>> with the beam_spark_job_server to submit my Python Beam job to Spark
>> Dataproc. All the Beam docs are about Flink and there is no guideline about
>> Spark with Dataproc.
>> Some relevant questions might be:
>> 1- What's spark-master-url in case of a remote cluster on Dataproc? Is
>> 7077 the master url port?
>> 2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh?
>> 3- What's the environment_type? Can we use DOCKER? Then what's the SDK
>> Harness Configuration?
>> 4- Should we run the job-server outside of the Dataproc cluster or should
>> we run it in the master node?
>>
>> Thanks,
>> Mahan
>>
>
>
> --
> Yu Watanabe
>
> linkedin: www.linkedin.com/in/yuwatanabe1/
> twitter:   twitter.com/yuwtennis
>
>

Reply via email to