Hi,

I have a Python Beam job that works on Dataflow but we would like to submit
it on a Spark Dataproc cluster with no Flink involvement.
I already spent days but failed to figure out how to use PortableRunner
with the beam_spark_job_server to submit my Python Beam job to Spark
Dataproc. All the Beam docs are about Flink and there is no guideline about
Spark with Dataproc.
Some relevant questions might be:
1- What's spark-master-url in case of a remote cluster on Dataproc? Is 7077
the master url port?
2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh?
3- What's the environment_type? Can we use DOCKER? Then what's the SDK
Harness Configuration?
4- Should we run the job-server outside of the Dataproc cluster or should
we run it in the master node?

Thanks,
Mahan

Reply via email to