Thanks Fokko, will take a look. On Thu, Feb 7, 2019 at 12:08 AM Driesprong, Fokko <fo...@driesprong.frl> wrote:
> Hi Tao, > > For the Dataproc, which is the managed hadoop of GCP, I've implemented a > method a while ago. It will check if the Python file is local, and if this > is the case, it will be uploaded to the temporary bucket which is provided > with the cluster: > > https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py#L1270-L1277 > This makes it easier to just package the Spark jobs together with the dags > in a single GIT repository: > > dir_path = os.path.dirname(os.path.realpath(__file__)) > DataProcPySparkOperator( > main=dir_path + 'my_pyspark_job.py' > ) > > Hope this helps. > > Cheers, Fokko > > > > Op do 7 feb. 2019 om 08:27 schreef Tao Feng <fengta...@gmail.com>: > > > Hi, > > > > I wonder any suggestions on how to use SparkOperator to send pyspark file > > to the spark cluster. And any suggestions on how to specify the pyspark > > dependency ? > > > > We currently push user pyspark file and dependency to a S3 location and > get > > picked up by our Spark cluster. And we would like to explore and see if > > there are suggestions on how to improve the workflow. > > > > Thanks, > > -Tao > > >