Re: Spark

Alexey Romanenko Fri, 18 Jan 2019 07:59:05 -0800

Hi Matt,

I just wanted to remind that you also can use Apache Livy [1] to launch Spark 
jobs (or Beam pipelines that are built with support of SparkRunner) on Spark 
using just REST API [2].
And of course, you need to create manually a “fat" jar and put it somewhere 
where Spark can find it.


[1] https://livy.incubator.apache.org/ <https://livy.incubator.apache.org/>
[2] https://livy.incubator.apache.org/docs/latest/rest-api.html 
<https://livy.incubator.apache.org/docs/latest/rest-api.html> 

> On 18 Jan 2019, at 13:03, Juan Carlos Garcia <jcgarc...@gmail.com> wrote:
> 
> Hi Matt, 
> 
> With flink you will be able launch your pipeline just by invoking the main 
> method of your main class, however it will run as standalone process and you 
> will not have the advantage of distribute computation. 
> 
> Am Fr., 18. Jan. 2019, 09:37 hat Matt Casters <mattcast...@gmail.com 
> <mailto:mattcast...@gmail.com>> geschrieben:
> Thanks for the reply JC, I really appreciate it.
> 
> I really can't force our users to use antiquated stuff like scripts, let 
> alone command line things, but I'll simply use SparkLauncher and your comment 
> about the main class doing Pipeline.run() on the Master is something I can 
> work with... somewhat.
> The execution results, metrics and all that are handled the Master I guess.  
> Over time I'll figure out a way to report the metrics and results from the 
> master back to the client.  I've done similar things with Map/Reduce in the 
> past.
> 
> Looking around I see that the same conditions apply for Flink.  Is this 
> because Spark and Flink lack the APIs to talk to a client about the state of 
> workloads unlike DataFlow and the Direct Runner?
> 
> Thanks!
> 
> Matt
> ---
> Matt Casters <m <mailto:mcast...@pentaho.org>attcast...@gmail.com 
> <mailto:attcast...@gmail.com>>
> Senior Solution Architect, Kettle Project Founder
> 
> 
> 
> 
> Op do 17 jan. 2019 om 15:30 schreef Juan Carlos Garcia <jcgarc...@gmail.com 
> <mailto:jcgarc...@gmail.com>>:
> Hi Matt, during the time we were using Spark with Beam, the solution was 
> always to pack the jar and use the spark-submit command pointing to your main 
> class which will do `pipeline.run`.
> 
> The spark-submit command have a flag to decide how to run it (--deploy-mode), 
> whether to launch the job on the driver machine or in one of the machine in 
> the cluster.
> 
> 
> JC
> 
> 
> On Thu, Jan 17, 2019 at 10:00 AM Matt Casters <mattcast...@gmail.com 
> <mailto:mattcast...@gmail.com>> wrote:
> Dear Beam friends,
> 
> Now that I've got cool data integration (Kettle-beam) scenarios running on 
> DataFlow with sample data sets in Google (Files, Pub/Sub, BigQuery, 
> Streaming, Windowing, ...) I thought it was time to also give Apache Spark 
> some attention.
> 
> The thing I have some trouble with it figuring out what the relationship is 
> between the runner (SparkRunner), Pipeline.run() and spark-submit (or 
> SparkLauncher).  
> 
> The samples I'm seeing mostly involve packaging up a jar file and then doing 
> a spark-submit.  That obviously makes it unclear if Pipeline.run() should be 
> used at all and how Metrics should be obtained from a Spark job during 
> execution or after completion.
> 
> I really like the way the GCP DataFlow implementation automatically deploys 
> jar file binaries and from what I can determine 
> org.apache.spark.launcher.SparkLauncher offers this functionality so perhaps 
> I'm either doing something wrong or I'm reading the docs wrong or the wrong 
> docs.
> The thing is, if you try running your pipelines against a Spark master 
> feedback is really minimal putting you in a trial & error situation pretty 
> quickly. 
> 
> So thanks again in advance for any help!
> 
> Cheers,
> 
> Matt 
> ---
> Matt Casters <m <mailto:mcast...@pentaho.org>attcast...@gmail.com 
> <mailto:attcast...@gmail.com>>
> Senior Solution Architect, Kettle Project Founder
> 
> 
> 
> -- 
> 
> JC 
>

Re: Spark

Reply via email to