Re: better way to schedule pyspark with SparkOperator on Airflow

Tao Feng Thu, 07 Feb 2019 10:05:15 -0800

Thanks Fokko, will take a look.

On Thu, Feb 7, 2019 at 12:08 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:


> Hi Tao,
>
> For the Dataproc, which is the managed hadoop of GCP, I've implemented a
> method a while ago. It will check if the Python file is local, and if this
> is the case, it will be uploaded to the temporary bucket which is provided
> with the cluster:
>
> https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py#L1270-L1277
> This makes it easier to just package the Spark jobs together with the dags
> in a single GIT repository:
>
> dir_path = os.path.dirname(os.path.realpath(__file__))
> DataProcPySparkOperator(
>   main=dir_path + 'my_pyspark_job.py'
> )
>
> Hope this helps.
>
> Cheers, Fokko
>
>
>
> Op do 7 feb. 2019 om 08:27 schreef Tao Feng <fengta...@gmail.com>:
>
> > Hi,
> >
> > I wonder any suggestions on how to use SparkOperator to send pyspark file
> > to the spark cluster. And any suggestions on how to specify the pyspark
> > dependency ?
> >
> > We currently push user pyspark file and dependency to a S3 location and
> get
> > picked up by our Spark cluster. And we would like to explore and see if
> > there are suggestions on how to improve the workflow.
> >
> > Thanks,
> > -Tao
> >
>

Re: better way to schedule pyspark with SparkOperator on Airflow

Reply via email to