Hi Matt, I just wanted to remind that you also can use Apache Livy [1] to launch Spark jobs (or Beam pipelines that are built with support of SparkRunner) on Spark using just REST API [2]. And of course, you need to create manually a “fat" jar and put it somewhere where Spark can find it.
[1] https://livy.incubator.apache.org/ <https://livy.incubator.apache.org/> [2] https://livy.incubator.apache.org/docs/latest/rest-api.html <https://livy.incubator.apache.org/docs/latest/rest-api.html> > On 18 Jan 2019, at 13:03, Juan Carlos Garcia <jcgarc...@gmail.com> wrote: > > Hi Matt, > > With flink you will be able launch your pipeline just by invoking the main > method of your main class, however it will run as standalone process and you > will not have the advantage of distribute computation. > > Am Fr., 18. Jan. 2019, 09:37 hat Matt Casters <mattcast...@gmail.com > <mailto:mattcast...@gmail.com>> geschrieben: > Thanks for the reply JC, I really appreciate it. > > I really can't force our users to use antiquated stuff like scripts, let > alone command line things, but I'll simply use SparkLauncher and your comment > about the main class doing Pipeline.run() on the Master is something I can > work with... somewhat. > The execution results, metrics and all that are handled the Master I guess. > Over time I'll figure out a way to report the metrics and results from the > master back to the client. I've done similar things with Map/Reduce in the > past. > > Looking around I see that the same conditions apply for Flink. Is this > because Spark and Flink lack the APIs to talk to a client about the state of > workloads unlike DataFlow and the Direct Runner? > > Thanks! > > Matt > --- > Matt Casters <m <mailto:mcast...@pentaho.org>attcast...@gmail.com > <mailto:attcast...@gmail.com>> > Senior Solution Architect, Kettle Project Founder > > > > > Op do 17 jan. 2019 om 15:30 schreef Juan Carlos Garcia <jcgarc...@gmail.com > <mailto:jcgarc...@gmail.com>>: > Hi Matt, during the time we were using Spark with Beam, the solution was > always to pack the jar and use the spark-submit command pointing to your main > class which will do `pipeline.run`. > > The spark-submit command have a flag to decide how to run it (--deploy-mode), > whether to launch the job on the driver machine or in one of the machine in > the cluster. > > > JC > > > On Thu, Jan 17, 2019 at 10:00 AM Matt Casters <mattcast...@gmail.com > <mailto:mattcast...@gmail.com>> wrote: > Dear Beam friends, > > Now that I've got cool data integration (Kettle-beam) scenarios running on > DataFlow with sample data sets in Google (Files, Pub/Sub, BigQuery, > Streaming, Windowing, ...) I thought it was time to also give Apache Spark > some attention. > > The thing I have some trouble with it figuring out what the relationship is > between the runner (SparkRunner), Pipeline.run() and spark-submit (or > SparkLauncher). > > The samples I'm seeing mostly involve packaging up a jar file and then doing > a spark-submit. That obviously makes it unclear if Pipeline.run() should be > used at all and how Metrics should be obtained from a Spark job during > execution or after completion. > > I really like the way the GCP DataFlow implementation automatically deploys > jar file binaries and from what I can determine > org.apache.spark.launcher.SparkLauncher offers this functionality so perhaps > I'm either doing something wrong or I'm reading the docs wrong or the wrong > docs. > The thing is, if you try running your pipelines against a Spark master > feedback is really minimal putting you in a trial & error situation pretty > quickly. > > So thanks again in advance for any help! > > Cheers, > > Matt > --- > Matt Casters <m <mailto:mcast...@pentaho.org>attcast...@gmail.com > <mailto:attcast...@gmail.com>> > Senior Solution Architect, Kettle Project Founder > > > > -- > > JC >