Hello,

  We are using beam, which is written by python. And we want to deploy to spark on yarn cluster . 

  But we found that it could only start one spark worker. Also on this worker, beam docker is started.

  - The beam pipeline option printed : 

{key:"beam:option:spark_master:v1" value:{string_value:"local[4]"}}  

{key:"beam:option:direct_runner_use_stacked_bundle:v1" value:{bool_value:true}}


  Our steps: 

Firstly, we generate jar file with python file. 

python3.7 -m  beam_demo \

--spark_version=3 \

--runner=SparkRunner \

--spark_job_server_jar=/home/hadoop/beam-runners-spark-3-job-server-2.52.0.jar \

--sdk_container_image=apache/beam_python3.7_sdk_boto3:2.45.0   \

--sdk_harness_container_image=apache/beam_python3.7_sdk_boto3:2.45.0 \

--worker_harness_container_image=apache/beam_python3.7_sdk_boto3:2.45.0 \

--environment_type=DOCKER --environment_config=apache/beam_python3.7_sdk_boto3:2.45.0 \

--sdk_location=container  \

--num_workers=2 \

--output_executable_path=jars/beam_demo.jar


  Secondly, we submit it to spark:

 spark-submit   --master yarn --deploy-mode cluster jars/beam_demo.jar


  

Our question is:

1. We configured sparkRunner. Why the runner is direct_runner_use_stacked_bundle?

2. We submitted job to spark on yarn. Why the printed spark_master is local[4]?

3. Could python beam on Spark yarn cluster start multi workers?


Could you please help to check those questions? Maybe we have something missing. Thank you very much for your help.

  



 


Reply via email to