Re: better way to schedule pyspark with SparkOperator on Airflow

Driesprong, Fokko Thu, 07 Feb 2019 00:08:36 -0800

Hi Tao,

For the Dataproc, which is the managed hadoop of GCP, I've implemented a
method a while ago. It will check if the Python file is local, and if this
is the case, it will be uploaded to the temporary bucket which is provided
with the cluster:
https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py#L1270-L1277
This makes it easier to just package the Spark jobs together with the dags
in a single GIT repository:


dir_path = os.path.dirname(os.path.realpath(__file__))
DataProcPySparkOperator(
  main=dir_path + 'my_pyspark_job.py'
)

Hope this helps.

Cheers, Fokko



Op do 7 feb. 2019 om 08:27 schreef Tao Feng <fengta...@gmail.com>:

> Hi,
>
> I wonder any suggestions on how to use SparkOperator to send pyspark file
> to the spark cluster. And any suggestions on how to specify the pyspark
> dependency ?
>
> We currently push user pyspark file and dependency to a S3 location and get
> picked up by our Spark cluster. And we would like to explore and see if
> there are suggestions on how to improve the workflow.
>
> Thanks,
> -Tao
>

Re: better way to schedule pyspark with SparkOperator on Airflow

Reply via email to