Hi Matt, during the time we were using Spark with Beam, the solution was always to pack the jar and use the spark-submit command pointing to your main class which will do `pipeline.run`.
The spark-submit command have a flag to decide how to run it (--deploy-mode), whether to launch the job on the driver machine or in one of the machine in the cluster. JC On Thu, Jan 17, 2019 at 10:00 AM Matt Casters <mattcast...@gmail.com> wrote: > Dear Beam friends, > > Now that I've got cool data integration (Kettle-beam) scenarios running on > DataFlow with sample data sets in Google (Files, Pub/Sub, BigQuery, > Streaming, Windowing, ...) I thought it was time to also give Apache Spark > some attention. > > The thing I have some trouble with it figuring out what the relationship > is between the runner (SparkRunner), Pipeline.run() and spark-submit (or > SparkLauncher). > > The samples I'm seeing mostly involve packaging up a jar file and then > doing a spark-submit. That obviously makes it unclear if Pipeline.run() > should be used at all and how Metrics should be obtained from a Spark job > during execution or after completion. > > I really like the way the GCP DataFlow implementation automatically > deploys jar file binaries and from what I can > determine org.apache.spark.launcher.SparkLauncher offers this functionality > so perhaps I'm either doing something wrong or I'm reading the docs wrong > or the wrong docs. > The thing is, if you try running your pipelines against a Spark master > feedback is really minimal putting you in a trial & error situation pretty > quickly. > > So thanks again in advance for any help! > > Cheers, > > Matt > --- > Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com> > Senior Solution Architect, Kettle Project Founder > > -- JC