Hello,
We are using beam, which is written by python. And we want to deploy to spark on yarn cluster .
But we found that it could only start one spark worker. Also on this worker, beam docker is started.
- The beam pipeline option printed :
{key:"beam:option:spark_master:v1" value:{string_value:"local[4]"}}
{key:"beam:option:direct_runner_use_stacked_bundle:v1" value:{bool_value:true}}
Our steps:
Firstly, we generate jar file with python file. python3.7 -m beam_demo \ --spark_version=3 \ --runner=SparkRunner \ --spark_job_server_jar=/home/hadoop/beam-runners-spark-3-job-server-2.52.0.jar \ --sdk_container_image=apache/beam_python3.7_sdk_boto3:2.45.0 \ --sdk_harness_container_image=apache/beam_python3.7_sdk_boto3:2.45.0 \ --worker_harness_container_image=apache/beam_python3.7_sdk_boto3:2.45.0 \ --environment_type=DOCKER --environment_config=apache/beam_python3.7_sdk_boto3:2.45.0 \ --sdk_location=container \ --num_workers=2 \ --output_executable_path=jars/beam_demo.jar Secondly, we submit it to spark: spark-submit --master yarn --deploy-mode cluster jars/beam_demo.jar |
Our question is:
1. We configured sparkRunner. Why the runner is direct_runner_use_stacked_bundle?
2. We submitted job to spark on yarn. Why the printed spark_master is local[4]?
3. Could python beam on Spark yarn cluster start multi workers?
Could you please help to check those questions? Maybe we have something missing. Thank you very much for your help.
|