Dear Beam friends, Now that I've got cool data integration (Kettle-beam) scenarios running on DataFlow with sample data sets in Google (Files, Pub/Sub, BigQuery, Streaming, Windowing, ...) I thought it was time to also give Apache Spark some attention.
The thing I have some trouble with it figuring out what the relationship is between the runner (SparkRunner), Pipeline.run() and spark-submit (or SparkLauncher). The samples I'm seeing mostly involve packaging up a jar file and then doing a spark-submit. That obviously makes it unclear if Pipeline.run() should be used at all and how Metrics should be obtained from a Spark job during execution or after completion. I really like the way the GCP DataFlow implementation automatically deploys jar file binaries and from what I can determine org.apache.spark.launcher.SparkLauncher offers this functionality so perhaps I'm either doing something wrong or I'm reading the docs wrong or the wrong docs. The thing is, if you try running your pipelines against a Spark master feedback is really minimal putting you in a trial & error situation pretty quickly. So thanks again in advance for any help! Cheers, Matt --- Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com> Senior Solution Architect, Kettle Project Founder