Spark

Matt Casters Thu, 17 Jan 2019 01:00:02 -0800

Dear Beam friends,

Now that I've got cool data integration (Kettle-beam) scenarios running on
DataFlow with sample data sets in Google (Files, Pub/Sub, BigQuery,
Streaming, Windowing, ...) I thought it was time to also give Apache Spark
some attention.


The thing I have some trouble with it figuring out what the relationship is
between the runner (SparkRunner), Pipeline.run() and spark-submit (or
SparkLauncher).

The samples I'm seeing mostly involve packaging up a jar file and then
doing a spark-submit.  That obviously makes it unclear if Pipeline.run()
should be used at all and how Metrics should be obtained from a Spark job
during execution or after completion.

I really like the way the GCP DataFlow implementation automatically deploys
jar file binaries and from what I can
determine org.apache.spark.launcher.SparkLauncher offers this functionality
so perhaps I'm either doing something wrong or I'm reading the docs wrong
or the wrong docs.
The thing is, if you try running your pipelines against a Spark master
feedback is really minimal putting you in a trial & error situation pretty
quickly.

So thanks again in advance for any help!

Cheers,

Matt
---
Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com>
Senior Solution Architect, Kettle Project Founder

Spark

Reply via email to