Hi Matt, during the time we were using Spark with Beam, the solution was
always to pack the jar and use the spark-submit command pointing to your
main class which will do `pipeline.run`.

The spark-submit command have a flag to decide how to run it
(--deploy-mode), whether to launch the job on the driver machine or in one
of the machine in the cluster.


JC


On Thu, Jan 17, 2019 at 10:00 AM Matt Casters <mattcast...@gmail.com> wrote:

> Dear Beam friends,
>
> Now that I've got cool data integration (Kettle-beam) scenarios running on
> DataFlow with sample data sets in Google (Files, Pub/Sub, BigQuery,
> Streaming, Windowing, ...) I thought it was time to also give Apache Spark
> some attention.
>
> The thing I have some trouble with it figuring out what the relationship
> is between the runner (SparkRunner), Pipeline.run() and spark-submit (or
> SparkLauncher).
>
> The samples I'm seeing mostly involve packaging up a jar file and then
> doing a spark-submit.  That obviously makes it unclear if Pipeline.run()
> should be used at all and how Metrics should be obtained from a Spark job
> during execution or after completion.
>
> I really like the way the GCP DataFlow implementation automatically
> deploys jar file binaries and from what I can
> determine org.apache.spark.launcher.SparkLauncher offers this functionality
> so perhaps I'm either doing something wrong or I'm reading the docs wrong
> or the wrong docs.
> The thing is, if you try running your pipelines against a Spark master
> feedback is really minimal putting you in a trial & error situation pretty
> quickly.
>
> So thanks again in advance for any help!
>
> Cheers,
>
> Matt
> ---
> Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com>
> Senior Solution Architect, Kettle Project Founder
>
>

-- 

JC

Reply via email to