How to spark-submit using python subprocess module?
I have a python script that is used to submit spark jobs using the spark-submit tool. I want to execute the command and write the output both to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu server. This is what I have so far in my SubmitJob.py script #!/usr/bin/python # Submit the commanddef submitJob(cmd, log_file): with open(log_file, 'w') as fh: process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) while True: output = process.stdout.readline() if output == '' and process.poll() is not None: break if output: print output.strip() fh.write(output) rc = process.poll() return rc if __name__ == "__main__": cmdList = ["dse", "spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"] log_file = "/tmp/out.log" exist_status = submitJob(cmdList, log_file) print "job finished with status ",exist_status The strange thing is, when I execute the same command directly in the shell it works fine and produces output on screen as the program proceeds. So it looks like something is wrong in the way I'm using the subprocess.PIPE for stdout and writing the file. What's the current recommended way to use subprocess module for writing to stdout and log file in real time line by line? I see a lot of different options on the internet but not sure which is correct or latest. Is there anything specific to the way spark-submit buffers the stdout that I need to take care of? thanks
Re: Spark REST API shows Error 503 Service Unavailable
No we are using standard spark w/ datastax cassandra. I'm able to see some json when I do http://10.1.40.16:7080/json/v1/applications but getting the following errors when I do http://10.1.40.16:7080/api/v1/applications HTTP ERROR 503 Problem accessing /api/v1/applications. Reason: Service Unavailable Caused by: org.spark-project.jetty.servlet.ServletHolder$1: java.lang.reflect.InvocationTargetException at org.spark-project.jetty.servlet.ServletHolder.makeUnavailable(ServletHolder.java:496) at org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:543) at org.spark-project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:415) at org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:657) at org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.spark-project.jetty.server.Server.handle(Server.java:370) at org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:728) at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:678) at com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203) at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:373) at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:556) at javax.servlet.GenericServlet.init(GenericServlet.java:244) at org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:532) ... 21 more Caused by: java.lang.IncompatibleClassChangeError: Implementing class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:760) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.sun.jersey.api.core.ScanningResourceConfig.init(ScanningResourceConfig.java:79) at com.sun.jersey.api.core.PackagesResourceConfig.init(PackagesResourceConfig.java:104) at com.sun.jersey.api.core.PackagesResourceConfig.(PackagesResourceConfig.java:78) at com.sun.jersey.api.core.PackagesResourceConfig.(Pac
Re: Spark REST API shows Error 503 Service Unavailable
Hi Prateek, Were you able to figure why this is happening? I'm seeing the same error on my spark standalone cluster. Any pointers anyone? On Fri, Dec 11, 2015 at 2:05 PM, prateek arora wrote: > > > Hi > > I am trying to access Spark Using REST API but got below error : > > Command : > > curl http://:18088/api/v1/applications > > Response: > > > > > > Error 503 Service Unavailable > > > HTTP ERROR 503 > > Problem accessing /api/v1/applications. Reason: > Service Unavailable > Caused by: > org.spark-project.jetty.servlet.ServletHolder$1: > java.lang.reflect.InvocationTargetException > at > > org.spark-project.jetty.servlet.ServletHolder.makeUnavailable(ServletHolder.java:496) > at > > org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:543) > at > > org.spark-project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:415) > at > > org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:657) > at > > org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) > at > > org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) > at > > org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) > at > > org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) > at > > org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > > org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:301) > at > > org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > > org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.spark-project.jetty.server.Server.handle(Server.java:370) > at > > org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > > org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) > at > > org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) > at > org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644) > at > org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > > org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > > org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) > at > > org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) > at > > org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > > org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > > com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:728) > at > > com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:678) > at > > com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203) > at > > com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:373) > at > > com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:556) > at javax.servlet.GenericServlet.init(GenericServlet.java:244) > at > > org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:532) > ... 22 more > Caused by: java.lang.NoSuchMethodError: > > com.sun.jersey.core.reflection.ReflectionHelper.getOsgiRegistryInstance()Lcom/sun/jersey/core/osgi/OsgiRegistry; > at > > com.sun.jersey.spi.scanning.AnnotationScannerListener$AnnotatedClassVisitor.getClassForName(AnnotationScannerListener.java:217) > at > > com.sun.jersey.spi.scanning.AnnotationScannerListener$AnnotatedClassVisitor.visitEnd(AnnotationScannerListener.java:186) > at org.objectweb.asm.ClassReader.accept(Unknown Source) > at org.objectweb.asm.ClassReader.accept(Unknown Source) > at > > com.sun.jersey.spi.scanning.AnnotationScannerListener.onProcess(AnnotationScannerListener.java:136) > at > > com.sun.jersey.core.spi.scanning.JarFileScanner.scan(JarFileScanner.java:97) > at > > com.sun
Re: How to kill spark applications submitted using spark-submit reliably?
I tried adding shutdown hook to my code but it didn't help. Still same issue On Fri, Nov 20, 2015 at 7:08 PM, Ted Yu wrote: > Which Spark release are you using ? > > Can you pastebin the stack trace of the process running on your machine ? > > Thanks > > On Nov 20, 2015, at 6:46 PM, Vikram Kone wrote: > > Hi, > I'm seeing a strange problem. I have a spark cluster in standalone mode. I > submit spark jobs from a remote node as follows from the terminal > > spark-submit --master spark://10.1.40.18:7077 --class com.test.Ping > spark-jobs.jar > > when the app is running , when I press ctrl-C on the console terminal, > then the process is killed and so is the app in the spark master UI. When I > go to spark master ui, i see that this app is in state Killed under > Completed applications, which is what I expected to see. > > Now, I created a shell script as follows to do the same > > #!/bin/bash > spark-submit --master spark://10.1.40.18:7077 --class com.test.Ping > spark-jobs.jar > echo $! > my.pid > > When I execute the shell script from terminal, as follows > > $> bash myscript.sh > > The application is submitted correctly to spark master and I can see it as > one of the running apps in teh spark master ui. But when I kill the process > in my terminal as follows > > $> ps kill $(cat my.pid) > > I see that the process is killed on my machine but the spark appliation is > still running in spark master! It doesn't get killed. > > I noticed one more thing that, when I launch the spark job via shell > script and kill the application from spark master UI by clicking on "kill" > next to the running application, it gets killed in spark ui but I still see > the process running in my machine. > > In both cases, I would expect the remote spark app to be killed and my > local process to be killed. > > Why is this happening? and how can I kill a spark app from the terminal > launced via shell script w.o going to the spark master UI? > > I want to launch the spark app via script and log the pid so i can monitor > it remotely > > thanks for the help > >
Re: How to kill spark applications submitted using spark-submit reliably?
Thanks for the info Stephane. Why does CTRL-C in the terminal running spark-submit kills the app in spark master correctly w/o any explicit shutdown hooks in the code? Can you explain why we need to add the shutdown hook to kill it when launched via a shell script ? For the second issue, I'm not using any thread pool. So not sure why killing the app in spark UI doesn't kill the process launched via script On Friday, November 20, 2015, Stéphane Verlet wrote: > I solved the first issue by adding a shutdown hook in my code. The > shutdown hook get call when you exit your script (ctrl-C , kill … but nor > kill -9) > > val shutdownHook = scala.sys.addShutdownHook { > try { > > sparkContext.stop() > //Make sure to kill any other threads or thread pool you may be running > } > catch { > case e: Exception => > { > ... > > } > } > > } > > For the other issue , kill from the UI. I also had the issue. This was > caused by a thread pool that I use. > > So I surrounded my code with try/finally block to guarantee that the > thread pool was shutdown when spark stopped > > I hopes this help > > Stephane > > > On Fri, Nov 20, 2015 at 7:46 PM, Vikram Kone > wrote: > >> Hi, >> I'm seeing a strange problem. I have a spark cluster in standalone mode. >> I submit spark jobs from a remote node as follows from the terminal >> >> spark-submit --master spark://10.1.40.18:7077 --class com.test.Ping >> spark-jobs.jar >> >> when the app is running , when I press ctrl-C on the console terminal, >> then the process is killed and so is the app in the spark master UI. When I >> go to spark master ui, i see that this app is in state Killed under >> Completed applications, which is what I expected to see. >> >> Now, I created a shell script as follows to do the same >> >> #!/bin/bash >> spark-submit --master spark://10.1.40.18:7077 --class com.test.Ping >> spark-jobs.jar >> echo $! > my.pid >> >> When I execute the shell script from terminal, as follows >> >> $> bash myscript.sh >> >> The application is submitted correctly to spark master and I can see it >> as one of the running apps in teh spark master ui. But when I kill the >> process in my terminal as follows >> >> $> ps kill $(cat my.pid) >> >> I see that the process is killed on my machine but the spark appliation >> is still running in spark master! It doesn't get killed. >> >> I noticed one more thing that, when I launch the spark job via shell >> script and kill the application from spark master UI by clicking on "kill" >> next to the running application, it gets killed in spark ui but I still see >> the process running in my machine. >> >> In both cases, I would expect the remote spark app to be killed and my >> local process to be killed. >> >> Why is this happening? and how can I kill a spark app from the terminal >> launced via shell script w.o going to the spark master UI? >> >> I want to launch the spark app via script and log the pid so i can >> monitor it remotely >> >> thanks for the help >> >> >
Re: How to kill spark applications submitted using spark-submit reliably?
Spark 1.4.1 On Friday, November 20, 2015, Ted Yu wrote: > Which Spark release are you using ? > > Can you pastebin the stack trace of the process running on your machine ? > > Thanks > > On Nov 20, 2015, at 6:46 PM, Vikram Kone > wrote: > > Hi, > I'm seeing a strange problem. I have a spark cluster in standalone mode. I > submit spark jobs from a remote node as follows from the terminal > > spark-submit --master spark://10.1.40.18:7077 --class com.test.Ping > spark-jobs.jar > > when the app is running , when I press ctrl-C on the console terminal, > then the process is killed and so is the app in the spark master UI. When I > go to spark master ui, i see that this app is in state Killed under > Completed applications, which is what I expected to see. > > Now, I created a shell script as follows to do the same > > #!/bin/bash > spark-submit --master spark://10.1.40.18:7077 --class com.test.Ping > spark-jobs.jar > echo $! > my.pid > > When I execute the shell script from terminal, as follows > > $> bash myscript.sh > > The application is submitted correctly to spark master and I can see it as > one of the running apps in teh spark master ui. But when I kill the process > in my terminal as follows > > $> ps kill $(cat my.pid) > > I see that the process is killed on my machine but the spark appliation is > still running in spark master! It doesn't get killed. > > I noticed one more thing that, when I launch the spark job via shell > script and kill the application from spark master UI by clicking on "kill" > next to the running application, it gets killed in spark ui but I still see > the process running in my machine. > > In both cases, I would expect the remote spark app to be killed and my > local process to be killed. > > Why is this happening? and how can I kill a spark app from the terminal > launced via shell script w.o going to the spark master UI? > > I want to launch the spark app via script and log the pid so i can monitor > it remotely > > thanks for the help > >
How to kill spark applications submitted using spark-submit reliably?
Hi, I'm seeing a strange problem. I have a spark cluster in standalone mode. I submit spark jobs from a remote node as follows from the terminal spark-submit --master spark://10.1.40.18:7077 --class com.test.Ping spark-jobs.jar when the app is running , when I press ctrl-C on the console terminal, then the process is killed and so is the app in the spark master UI. When I go to spark master ui, i see that this app is in state Killed under Completed applications, which is what I expected to see. Now, I created a shell script as follows to do the same #!/bin/bash spark-submit --master spark://10.1.40.18:7077 --class com.test.Ping spark-jobs.jar echo $! > my.pid When I execute the shell script from terminal, as follows $> bash myscript.sh The application is submitted correctly to spark master and I can see it as one of the running apps in teh spark master ui. But when I kill the process in my terminal as follows $> ps kill $(cat my.pid) I see that the process is killed on my machine but the spark appliation is still running in spark master! It doesn't get killed. I noticed one more thing that, when I launch the spark job via shell script and kill the application from spark master UI by clicking on "kill" next to the running application, it gets killed in spark ui but I still see the process running in my machine. In both cases, I would expect the remote spark app to be killed and my local process to be killed. Why is this happening? and how can I kill a spark app from the terminal launced via shell script w.o going to the spark master UI? I want to launch the spark app via script and log the pid so i can monitor it remotely thanks for the help
Re: Spark job workflow engine recommendations
Hi Feng, Does airflow allow remote submissions of spark jobs via spark-submit? On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu wrote: > Hi, > > we use ‘Airflow' as our job workflow scheduler. > > > > > On Nov 19, 2015, at 9:47 AM, Vikram Kone wrote: > > Hi Nick, > Quick question about spark-submit command executed from azkaban with > command job type. > I see that when I press kill in azkaban portal on a spark-submit job, it > doesn't actually kill the application on spark master and it continues to > run even though azkaban thinks that it's killed. > How do you get around this? Is there a way to kill the spark-submit jobs > from azkaban portal? > > On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath > wrote: > >> Hi Vikram, >> >> We use Azkaban (2.5.0) in our production workflow scheduling. We just use >> local mode deployment and it is fairly easy to set up. It is pretty easy to >> use and has a nice scheduling and logging interface, as well as SLAs (like >> kill job and notify if it doesn't complete in 3 hours or whatever). >> >> However Spark support is not present directly - we run everything with >> shell scripts and spark-submit. There is a plugin interface where one could >> create a Spark plugin, but I found it very cumbersome when I did >> investigate and didn't have the time to work through it to develop that. >> >> It has some quirks and while there is actually a REST API for adding jos >> and dynamically scheduling jobs, it is not documented anywhere so you kinda >> have to figure it out for yourself. But in terms of ease of use I found it >> way better than Oozie. I haven't tried Chronos, and it seemed quite >> involved to set up. Haven't tried Luigi either. >> >> Spark job server is good but as you say lacks some stuff like scheduling >> and DAG type workflows (independent of spark-defined job flows). >> >> >> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke wrote: >> >>> Check also falcon in combination with oozie >>> >>> Le ven. 7 août 2015 à 17:51, Hien Luu a >>> écrit : >>> >>>> Looks like Oozie can satisfy most of your requirements. >>>> >>>> >>>> >>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone >>>> wrote: >>>> >>>>> Hi, >>>>> I'm looking for open source workflow tools/engines that allow us to >>>>> schedule spark jobs on a datastax cassandra cluster. Since there are >>>>> tonnes >>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I >>>>> wanted to check with people here to see what they are using today. >>>>> >>>>> Some of the requirements of the workflow engine that I'm looking for >>>>> are >>>>> >>>>> 1. First class support for submitting Spark jobs on Cassandra. Not >>>>> some wrapper Java code to submit tasks. >>>>> 2. Active open source community support and well tested at production >>>>> scale. >>>>> 3. Should be dead easy to write job dependencices using XML or web >>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and >>>>> C are finished. Don't need to write full blown java applications to >>>>> specify >>>>> job parameters and dependencies. Should be very simple to use. >>>>> 4. Time based recurrent scheduling. Run the spark jobs at a given >>>>> time every hour or day or week or month. >>>>> 5. Job monitoring, alerting on failures and email notifications on >>>>> daily basis. >>>>> >>>>> I have looked at Ooyala's spark job server which seems to be hated >>>>> towards making spark jobs run faster by sharing contexts between the jobs >>>>> but isn't a full blown workflow engine per se. A combination of spark job >>>>> server and workflow engine would be ideal >>>>> >>>>> Thanks for the inputs >>>>> >>>> >>>> >> > >
Re: Spark job workflow engine recommendations
Hi Nick, Quick question about spark-submit command executed from azkaban with command job type. I see that when I press kill in azkaban portal on a spark-submit job, it doesn't actually kill the application on spark master and it continues to run even though azkaban thinks that it's killed. How do you get around this? Is there a way to kill the spark-submit jobs from azkaban portal? On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath wrote: > Hi Vikram, > > We use Azkaban (2.5.0) in our production workflow scheduling. We just use > local mode deployment and it is fairly easy to set up. It is pretty easy to > use and has a nice scheduling and logging interface, as well as SLAs (like > kill job and notify if it doesn't complete in 3 hours or whatever). > > However Spark support is not present directly - we run everything with > shell scripts and spark-submit. There is a plugin interface where one could > create a Spark plugin, but I found it very cumbersome when I did > investigate and didn't have the time to work through it to develop that. > > It has some quirks and while there is actually a REST API for adding jos > and dynamically scheduling jobs, it is not documented anywhere so you kinda > have to figure it out for yourself. But in terms of ease of use I found it > way better than Oozie. I haven't tried Chronos, and it seemed quite > involved to set up. Haven't tried Luigi either. > > Spark job server is good but as you say lacks some stuff like scheduling > and DAG type workflows (independent of spark-defined job flows). > > > On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke wrote: > >> Check also falcon in combination with oozie >> >> Le ven. 7 août 2015 à 17:51, Hien Luu a >> écrit : >> >>> Looks like Oozie can satisfy most of your requirements. >>> >>> >>> >>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone >>> wrote: >>> >>>> Hi, >>>> I'm looking for open source workflow tools/engines that allow us to >>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes >>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I >>>> wanted to check with people here to see what they are using today. >>>> >>>> Some of the requirements of the workflow engine that I'm looking for are >>>> >>>> 1. First class support for submitting Spark jobs on Cassandra. Not some >>>> wrapper Java code to submit tasks. >>>> 2. Active open source community support and well tested at production >>>> scale. >>>> 3. Should be dead easy to write job dependencices using XML or web >>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and >>>> C are finished. Don't need to write full blown java applications to specify >>>> job parameters and dependencies. Should be very simple to use. >>>> 4. Time based recurrent scheduling. Run the spark jobs at a given time >>>> every hour or day or week or month. >>>> 5. Job monitoring, alerting on failures and email notifications on >>>> daily basis. >>>> >>>> I have looked at Ooyala's spark job server which seems to be hated >>>> towards making spark jobs run faster by sharing contexts between the jobs >>>> but isn't a full blown workflow engine per se. A combination of spark job >>>> server and workflow engine would be ideal >>>> >>>> Thanks for the inputs >>>> >>> >>> >
Re: Spark job workflow engine recommendations
Hien, I saw this pull request and from what I understand this is geared towards running spark jobs over hadoop. We are using spark over cassandra and not sure if this new jobtype supports that. I haven't seen any documentation in regards to how to use this spark job plugin, so that I can test it out on our cluster. We are currently submitting our spark jobs using command job type using the following command "dse spark-submit --class com.org.classname ./test.jar" etc. What would be the advantage of using the native spark job type over command job type? I didn't understand from your reply if azkaban already supports long running jobs like spark streaming..does it? streaming jobs generally need to be running indefinitely or forever and needs to be restarted if for some reason they fail (lack of resources may be..). I can probably use the auto retry feature for this, but not sure I'm looking forward to the multiple executor support which will greatly enhance the scalability issue. On Wed, Oct 7, 2015 at 9:56 AM, Hien Luu wrote: > The spark job type was added recently - see this pull request > https://github.com/azkaban/azkaban-plugins/pull/195. You can leverage > the SLA feature to kill a job if it ran longer than expected. > > BTW, we just solved the scalability issue by supporting multiple > executors. Within a week or two, the code for that should be merged in the > main trunk. > > Hien > > On Tue, Oct 6, 2015 at 9:40 PM, Vikram Kone wrote: > >> Does Azkaban support scheduling long running jobs like spark steaming >> jobs? Will Azkaban kill a job if it's running for a long time. >> >> >> On Friday, August 7, 2015, Vikram Kone wrote: >> >>> Hien, >>> Is Azkaban being phased out at linkedin as rumored? If so, what's >>> linkedin going to use for workflow scheduling? Is there something else >>> that's going to replace Azkaban? >>> >>> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu wrote: >>> >>>> In my opinion, choosing some particular project among its peers should >>>> leave enough room for future growth (which may come faster than you >>>> initially think). >>>> >>>> Cheers >>>> >>>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu wrote: >>>> >>>>> Scalability is a known issue due the the current architecture. >>>>> However this will be applicable if you run more 20K jobs per day. >>>>> >>>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu wrote: >>>>> >>>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban >>>>>> is being phased out at LinkedIn because of scalability issues (though >>>>>> UI-wise, Azkaban seems better). >>>>>> >>>>>> Vikram: >>>>>> I suggest you do more research in related projects (maybe using their >>>>>> mailing lists). >>>>>> >>>>>> Disclaimer: I don't work for LinkedIn. >>>>>> >>>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath < >>>>>> nick.pentre...@gmail.com> wrote: >>>>>> >>>>>>> Hi Vikram, >>>>>>> >>>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We >>>>>>> just use local mode deployment and it is fairly easy to set up. It is >>>>>>> pretty easy to use and has a nice scheduling and logging interface, as >>>>>>> well >>>>>>> as SLAs (like kill job and notify if it doesn't complete in 3 hours or >>>>>>> whatever). >>>>>>> >>>>>>> However Spark support is not present directly - we run everything >>>>>>> with shell scripts and spark-submit. There is a plugin interface where >>>>>>> one >>>>>>> could create a Spark plugin, but I found it very cumbersome when I did >>>>>>> investigate and didn't have the time to work through it to develop that. >>>>>>> >>>>>>> It has some quirks and while there is actually a REST API for adding >>>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so >>>>>>> you >>>>>>> kinda have to figure it out for yourself. But in terms of ease of use I >>>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed >>>>>>> quite involved to set up. Haven'
Re: Spark job workflow engine recommendations
Does Azkaban support scheduling long running jobs like spark steaming jobs? Will Azkaban kill a job if it's running for a long time. On Friday, August 7, 2015, Vikram Kone wrote: > Hien, > Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin > going to use for workflow scheduling? Is there something else that's going > to replace Azkaban? > > On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu > wrote: > >> In my opinion, choosing some particular project among its peers should >> leave enough room for future growth (which may come faster than you >> initially think). >> >> Cheers >> >> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu > > wrote: >> >>> Scalability is a known issue due the the current architecture. However >>> this will be applicable if you run more 20K jobs per day. >>> >>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu >> > wrote: >>> >>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is >>>> being phased out at LinkedIn because of scalability issues (though UI-wise, >>>> Azkaban seems better). >>>> >>>> Vikram: >>>> I suggest you do more research in related projects (maybe using their >>>> mailing lists). >>>> >>>> Disclaimer: I don't work for LinkedIn. >>>> >>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath < >>>> nick.pentre...@gmail.com >>>> > wrote: >>>> >>>>> Hi Vikram, >>>>> >>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just >>>>> use local mode deployment and it is fairly easy to set up. It is pretty >>>>> easy to use and has a nice scheduling and logging interface, as well as >>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or >>>>> whatever). >>>>> >>>>> However Spark support is not present directly - we run everything with >>>>> shell scripts and spark-submit. There is a plugin interface where one >>>>> could >>>>> create a Spark plugin, but I found it very cumbersome when I did >>>>> investigate and didn't have the time to work through it to develop that. >>>>> >>>>> It has some quirks and while there is actually a REST API for adding >>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you >>>>> kinda have to figure it out for yourself. But in terms of ease of use I >>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed >>>>> quite involved to set up. Haven't tried Luigi either. >>>>> >>>>> Spark job server is good but as you say lacks some stuff like >>>>> scheduling and DAG type workflows (independent of spark-defined job >>>>> flows). >>>>> >>>>> >>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke >>>> > wrote: >>>>> >>>>>> Check also falcon in combination with oozie >>>>>> >>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu a >>>>>> écrit : >>>>>> >>>>>>> Looks like Oozie can satisfy most of your requirements. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone >>>>>> > wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> I'm looking for open source workflow tools/engines that allow us to >>>>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are >>>>>>>> tonnes >>>>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I >>>>>>>> wanted to check with people here to see what they are using today. >>>>>>>> >>>>>>>> Some of the requirements of the workflow engine that I'm looking >>>>>>>> for are >>>>>>>> >>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not >>>>>>>> some wrapper Java code to submit tasks. >>>>>>>> 2. Active open source community support and well tested at >>>>>>>> production scale. >>>>>>>> 3. Should be dead easy to write job dependencices using XML or web >>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B >>>>>>>> and >>>>>>>> C are finished. Don't need to write full blown java applications to >>>>>>>> specify >>>>>>>> job parameters and dependencies. Should be very simple to use. >>>>>>>> 4. Time based recurrent scheduling. Run the spark jobs at a given >>>>>>>> time every hour or day or week or month. >>>>>>>> 5. Job monitoring, alerting on failures and email notifications on >>>>>>>> daily basis. >>>>>>>> >>>>>>>> I have looked at Ooyala's spark job server which seems to be hated >>>>>>>> towards making spark jobs run faster by sharing contexts between the >>>>>>>> jobs >>>>>>>> but isn't a full blown workflow engine per se. A combination of spark >>>>>>>> job >>>>>>>> server and workflow engine would be ideal >>>>>>>> >>>>>>>> Thanks for the inputs >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>> >> >
Re: Notification on Spark Streaming job failure
We are using Monit to kick off spark streaming jobs n seems to work fine. On Monday, September 28, 2015, Chen Song wrote: > I am also interested specifically in monitoring and alerting on Spark > streaming jobs. It will be helpful to get some general guidelines or advice > on this, from people who implemented anything on this. > > On Fri, Sep 18, 2015 at 2:35 AM, Krzysztof Zarzycki > wrote: > >> Hi there Spark Community, >> I would like to ask you for an advice: I'm running Spark Streaming jobs >> in production. Sometimes these jobs fail and I would like to get email >> notification about it. Do you know how I can set up Spark to notify me by >> email if my job fails? Or do I have to use external monitoring tool? >> I'm thinking of the following options: >> 1. As I'm running those jobs on YARN, monitor somehow YARN jobs. Looked >> for it as well but couldn't find any YARN feature to do it. >> 2. Run Spark Streaming job in some scheduler, like Oozie, Azkaban, Luigi. >> Those are created rather for batch jobs, not streaming, but could work. Has >> anyone tried that? >> 3. Run job driver under "monit" tool and catch the failure and send an >> email about it. Currently I'm deploying with yarn-cluster mode and I would >> need to resign from it to run under monit >> 4. Implement monitoring tool (like Graphite, Ganglia, Prometheus) and use >> Spark metrics. And then implement alerting in those. Can I get information >> of failed jobs in Spark metrics? >> 5. As 4. but implement my own custom job metrics and monitor them. >> >> What's your opinion about my options? How do you people solve this >> problem? Anything Spark specific? >> I'll be grateful for any advice in this subject. >> Thanks! >> Krzysiek >> >> > > > -- > Chen Song > >
How to run spark in standalone mode on cassandra with high availability?
Hi, We are planning to install Spark in stand alone mode on cassandra cluster. The problem, is since Cassandra has a no-SPOF architecture ie any node can become the master for the cluster, it creates the problem for Spark master since it's not a peer-peer architecture where any node can become the master. What are our options here? Are there any framworks or tools out there that would allow any application to run on a cluster of machines with high availablity?
Re: Spark job workflow engine recommendations
r job does not start? * Do you need high availability for job scheduling? That will require additional components. This became a bit of a brain dump on the topic. I hope that it is useful. Don't hesitate to get back if I can help. Regards, Lars Albertsson On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone wrote: > Hi, > I'm looking for open source workflow tools/engines that allow us to schedule > spark jobs on a datastax cassandra cluster. Since there are tonnes of > alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to > check with people here to see what they are using today. > > Some of the requirements of the workflow engine that I'm looking for are > > 1. First class support for submitting Spark jobs on Cassandra. Not some > wrapper Java code to submit tasks. > 2. Active open source community support and well tested at production scale. > 3. Should be dead easy to write job dependencices using XML or web interface > . Ex; job A depends on Job B and Job C, so run Job A after B and C are > finished. Don't need to write full blown java applications to specify job > parameters and dependencies. Should be very simple to use. > 4. Time based recurrent scheduling. Run the spark jobs at a given time > every hour or day or week or month. > 5. Job monitoring, alerting on failures and email notifications on daily > basis. > > I have looked at Ooyala's spark job server which seems to be hated towards > making spark jobs run faster by sharing contexts between the jobs but isn't > a full blown workflow engine per se. A combination of spark job server and > workflow engine would be ideal > > Thanks for the inputs
Re: Spark job workflow engine recommendations
Hien, Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin going to use for workflow scheduling? Is there something else that's going to replace Azkaban? On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu wrote: > In my opinion, choosing some particular project among its peers should > leave enough room for future growth (which may come faster than you > initially think). > > Cheers > > On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu wrote: > >> Scalability is a known issue due the the current architecture. However >> this will be applicable if you run more 20K jobs per day. >> >> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu wrote: >> >>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is >>> being phased out at LinkedIn because of scalability issues (though UI-wise, >>> Azkaban seems better). >>> >>> Vikram: >>> I suggest you do more research in related projects (maybe using their >>> mailing lists). >>> >>> Disclaimer: I don't work for LinkedIn. >>> >>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath < >>> nick.pentre...@gmail.com> wrote: >>> >>>> Hi Vikram, >>>> >>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just >>>> use local mode deployment and it is fairly easy to set up. It is pretty >>>> easy to use and has a nice scheduling and logging interface, as well as >>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or >>>> whatever). >>>> >>>> However Spark support is not present directly - we run everything with >>>> shell scripts and spark-submit. There is a plugin interface where one could >>>> create a Spark plugin, but I found it very cumbersome when I did >>>> investigate and didn't have the time to work through it to develop that. >>>> >>>> It has some quirks and while there is actually a REST API for adding >>>> jos and dynamically scheduling jobs, it is not documented anywhere so you >>>> kinda have to figure it out for yourself. But in terms of ease of use I >>>> found it way better than Oozie. I haven't tried Chronos, and it seemed >>>> quite involved to set up. Haven't tried Luigi either. >>>> >>>> Spark job server is good but as you say lacks some stuff like >>>> scheduling and DAG type workflows (independent of spark-defined job flows). >>>> >>>> >>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke >>>> wrote: >>>> >>>>> Check also falcon in combination with oozie >>>>> >>>>> Le ven. 7 août 2015 à 17:51, Hien Luu a >>>>> écrit : >>>>> >>>>>> Looks like Oozie can satisfy most of your requirements. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I'm looking for open source workflow tools/engines that allow us to >>>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are >>>>>>> tonnes >>>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I >>>>>>> wanted to check with people here to see what they are using today. >>>>>>> >>>>>>> Some of the requirements of the workflow engine that I'm looking for >>>>>>> are >>>>>>> >>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not >>>>>>> some wrapper Java code to submit tasks. >>>>>>> 2. Active open source community support and well tested at >>>>>>> production scale. >>>>>>> 3. Should be dead easy to write job dependencices using XML or web >>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B >>>>>>> and >>>>>>> C are finished. Don't need to write full blown java applications to >>>>>>> specify >>>>>>> job parameters and dependencies. Should be very simple to use. >>>>>>> 4. Time based recurrent scheduling. Run the spark jobs at a given >>>>>>> time every hour or day or week or month. >>>>>>> 5. Job monitoring, alerting on failures and email notifications on >>>>>>> daily basis. >>>>>>> >>>>>>> I have looked at Ooyala's spark job server which seems to be hated >>>>>>> towards making spark jobs run faster by sharing contexts between the >>>>>>> jobs >>>>>>> but isn't a full blown workflow engine per se. A combination of spark >>>>>>> job >>>>>>> server and workflow engine would be ideal >>>>>>> >>>>>>> Thanks for the inputs >>>>>>> >>>>>> >>>>>> >>>> >>> >> >
Re: Spark job workflow engine recommendations
Oh ok. That's a good enough reason against azkaban then. So looks like Oozie is the best choice here. On Friday, August 7, 2015, Ted Yu wrote: > From what I heard (an ex-coworker who is Oozie committer), Azkaban is > being phased out at LinkedIn because of scalability issues (though UI-wise, > Azkaban seems better). > > Vikram: > I suggest you do more research in related projects (maybe using their > mailing lists). > > Disclaimer: I don't work for LinkedIn. > > On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath > wrote: > >> Hi Vikram, >> >> We use Azkaban (2.5.0) in our production workflow scheduling. We just use >> local mode deployment and it is fairly easy to set up. It is pretty easy to >> use and has a nice scheduling and logging interface, as well as SLAs (like >> kill job and notify if it doesn't complete in 3 hours or whatever). >> >> However Spark support is not present directly - we run everything with >> shell scripts and spark-submit. There is a plugin interface where one could >> create a Spark plugin, but I found it very cumbersome when I did >> investigate and didn't have the time to work through it to develop that. >> >> It has some quirks and while there is actually a REST API for adding jos >> and dynamically scheduling jobs, it is not documented anywhere so you kinda >> have to figure it out for yourself. But in terms of ease of use I found it >> way better than Oozie. I haven't tried Chronos, and it seemed quite >> involved to set up. Haven't tried Luigi either. >> >> Spark job server is good but as you say lacks some stuff like scheduling >> and DAG type workflows (independent of spark-defined job flows). >> >> >> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke > > wrote: >> >>> Check also falcon in combination with oozie >>> >>> Le ven. 7 août 2015 à 17:51, Hien Luu a >>> écrit : >>> >>>> Looks like Oozie can satisfy most of your requirements. >>>> >>>> >>>> >>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone >>> > wrote: >>>> >>>>> Hi, >>>>> I'm looking for open source workflow tools/engines that allow us to >>>>> schedule spark jobs on a datastax cassandra cluster. Since there are >>>>> tonnes >>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I >>>>> wanted to check with people here to see what they are using today. >>>>> >>>>> Some of the requirements of the workflow engine that I'm looking for >>>>> are >>>>> >>>>> 1. First class support for submitting Spark jobs on Cassandra. Not >>>>> some wrapper Java code to submit tasks. >>>>> 2. Active open source community support and well tested at production >>>>> scale. >>>>> 3. Should be dead easy to write job dependencices using XML or web >>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and >>>>> C are finished. Don't need to write full blown java applications to >>>>> specify >>>>> job parameters and dependencies. Should be very simple to use. >>>>> 4. Time based recurrent scheduling. Run the spark jobs at a given >>>>> time every hour or day or week or month. >>>>> 5. Job monitoring, alerting on failures and email notifications on >>>>> daily basis. >>>>> >>>>> I have looked at Ooyala's spark job server which seems to be hated >>>>> towards making spark jobs run faster by sharing contexts between the jobs >>>>> but isn't a full blown workflow engine per se. A combination of spark job >>>>> server and workflow engine would be ideal >>>>> >>>>> Thanks for the inputs >>>>> >>>> >>>> >> >
Re: Spark job workflow engine recommendations
Thanks for the suggestion Hien. I'm curious why not azkaban from linkedin. >From what I read online Oozie was very cumbersome to setup and use compared to azkaban. Since you are from linkedin wanted to get some perspective on what it lacks compared to Oozie. Ease of use is very important more than full feature set On Friday, August 7, 2015, Hien Luu wrote: > Looks like Oozie can satisfy most of your requirements. > > > > On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone > wrote: > >> Hi, >> I'm looking for open source workflow tools/engines that allow us to >> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes >> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I >> wanted to check with people here to see what they are using today. >> >> Some of the requirements of the workflow engine that I'm looking for are >> >> 1. First class support for submitting Spark jobs on Cassandra. Not some >> wrapper Java code to submit tasks. >> 2. Active open source community support and well tested at production >> scale. >> 3. Should be dead easy to write job dependencices using XML or web >> interface . Ex; job A depends on Job B and Job C, so run Job A after B and >> C are finished. Don't need to write full blown java applications to specify >> job parameters and dependencies. Should be very simple to use. >> 4. Time based recurrent scheduling. Run the spark jobs at a given time >> every hour or day or week or month. >> 5. Job monitoring, alerting on failures and email notifications on daily >> basis. >> >> I have looked at Ooyala's spark job server which seems to be hated >> towards making spark jobs run faster by sharing contexts between the jobs >> but isn't a full blown workflow engine per se. A combination of spark job >> server and workflow engine would be ideal >> >> Thanks for the inputs >> > >
Spark job workflow engine recommendations
Hi, I'm looking for open source workflow tools/engines that allow us to schedule spark jobs on a datastax cassandra cluster. Since there are tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to check with people here to see what they are using today. Some of the requirements of the workflow engine that I'm looking for are 1. First class support for submitting Spark jobs on Cassandra. Not some wrapper Java code to submit tasks. 2. Active open source community support and well tested at production scale. 3. Should be dead easy to write job dependencices using XML or web interface . Ex; job A depends on Job B and Job C, so run Job A after B and C are finished. Don't need to write full blown java applications to specify job parameters and dependencies. Should be very simple to use. 4. Time based recurrent scheduling. Run the spark jobs at a given time every hour or day or week or month. 5. Job monitoring, alerting on failures and email notifications on daily basis. I have looked at Ooyala's spark job server which seems to be hated towards making spark jobs run faster by sharing contexts between the jobs but isn't a full blown workflow engine per se. A combination of spark job server and workflow engine would be ideal Thanks for the inputs