How to spark-submit using python subprocess module?

2016-10-13 Thread Vikram Kone
I have a python script that is used to submit spark jobs using the
spark-submit tool. I want to execute the command and write the output both
to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu
server.

This is what I have so far in my SubmitJob.py script

#!/usr/bin/python
# Submit the commanddef submitJob(cmd, log_file):
with open(log_file, 'w') as fh:
process = subprocess.Popen(cmd, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
while True:
output = process.stdout.readline()
if output == '' and process.poll() is not None:
break
if output:
print output.strip()
fh.write(output)
rc = process.poll()
return rc
if __name__ == "__main__":
cmdList = ["dse", "spark-submit", "--spark-master",
"spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]
log_file = "/tmp/out.log"
exist_status = submitJob(cmdList, log_file)
print "job finished with status ",exist_status

The strange thing is, when I execute the same command directly in the shell
it works fine and produces output on screen as the program proceeds.

So it looks like something is wrong in the way I'm using the
subprocess.PIPE for stdout and writing the file.

What's the current recommended way to use subprocess module for writing to
stdout and log file in real time line by line? I see a lot of different
options on the internet but not sure which is correct or latest.

Is there  anything specific to the way spark-submit buffers the stdout that
I need to take care of?

thanks


Re: Spark REST API shows Error 503 Service Unavailable

2015-12-17 Thread Vikram Kone
No we are using standard spark w/ datastax cassandra. I'm able to see some
json when I do  http://10.1.40.16:7080/json/v1/applications
but getting the following errors when I do
http://10.1.40.16:7080/api/v1/applications

HTTP ERROR 503

Problem accessing /api/v1/applications. Reason:

Service Unavailable

Caused by:

org.spark-project.jetty.servlet.ServletHolder$1:
java.lang.reflect.InvocationTargetException
at 
org.spark-project.jetty.servlet.ServletHolder.makeUnavailable(ServletHolder.java:496)
at 
org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:543)
at 
org.spark-project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:415)
at 
org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:657)
at 
org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at 
org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at 
org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at 
org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at 
org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.spark-project.jetty.server.Server.handle(Server.java:370)
at 
org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at 
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at 
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at 
org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:728)
at 
com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:678)
at 
com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:373)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:556)
at javax.servlet.GenericServlet.init(GenericServlet.java:244)
at 
org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:532)
... 21 more
Caused by: java.lang.IncompatibleClassChangeError: Implementing class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
com.sun.jersey.api.core.ScanningResourceConfig.init(ScanningResourceConfig.java:79)
at 
com.sun.jersey.api.core.PackagesResourceConfig.init(PackagesResourceConfig.java:104)
at 
com.sun.jersey.api.core.PackagesResourceConfig.(PackagesResourceConfig.java:78)
at 
com.sun.jersey.api.core.PackagesResourceConfig.(Pac

Re: Spark REST API shows Error 503 Service Unavailable

2015-12-17 Thread Vikram Kone
Hi Prateek,
Were you able to figure why this is happening? I'm seeing the same error on
my spark standalone cluster.

Any pointers anyone?

On Fri, Dec 11, 2015 at 2:05 PM, prateek arora 
wrote:

>
>
> Hi
>
> I am trying to access Spark Using REST API but got below error :
>
> Command :
>
> curl http://:18088/api/v1/applications
>
> Response:
>
>
> 
> 
> 
> Error 503 Service Unavailable
> 
> 
> HTTP ERROR 503
>
> Problem accessing /api/v1/applications. Reason:
> Service Unavailable
> Caused by:
> org.spark-project.jetty.servlet.ServletHolder$1:
> java.lang.reflect.InvocationTargetException
> at
>
> org.spark-project.jetty.servlet.ServletHolder.makeUnavailable(ServletHolder.java:496)
> at
>
> org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:543)
> at
>
> org.spark-project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:415)
> at
>
> org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:657)
> at
>
> org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
> at
>
> org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
> at
>
> org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
> at
>
> org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
> at
>
> org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
>
> org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:301)
> at
>
> org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
>
> org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.spark-project.jetty.server.Server.handle(Server.java:370)
> at
>
> org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
> at
>
> org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
> at
>
> org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
> at
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644)
> at
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at
>
> org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
> at
>
> org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
> at
>
> org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
> at
>
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
>
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
>
> com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:728)
> at
>
> com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:678)
> at
>
> com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203)
> at
>
> com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:373)
> at
>
> com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:556)
> at javax.servlet.GenericServlet.init(GenericServlet.java:244)
> at
>
> org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:532)
> ... 22 more
> Caused by: java.lang.NoSuchMethodError:
>
> com.sun.jersey.core.reflection.ReflectionHelper.getOsgiRegistryInstance()Lcom/sun/jersey/core/osgi/OsgiRegistry;
> at
>
> com.sun.jersey.spi.scanning.AnnotationScannerListener$AnnotatedClassVisitor.getClassForName(AnnotationScannerListener.java:217)
> at
>
> com.sun.jersey.spi.scanning.AnnotationScannerListener$AnnotatedClassVisitor.visitEnd(AnnotationScannerListener.java:186)
> at org.objectweb.asm.ClassReader.accept(Unknown Source)
> at org.objectweb.asm.ClassReader.accept(Unknown Source)
> at
>
> com.sun.jersey.spi.scanning.AnnotationScannerListener.onProcess(AnnotationScannerListener.java:136)
> at
>
> com.sun.jersey.core.spi.scanning.JarFileScanner.scan(JarFileScanner.java:97)
> at
>
> com.sun

Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
I tried adding shutdown hook to my code but it didn't help. Still same issue


On Fri, Nov 20, 2015 at 7:08 PM, Ted Yu  wrote:

> Which Spark release are you using ?
>
> Can you pastebin the stack trace of the process running on your machine ?
>
> Thanks
>
> On Nov 20, 2015, at 6:46 PM, Vikram Kone  wrote:
>
> Hi,
> I'm seeing a strange problem. I have a spark cluster in standalone mode. I
> submit spark jobs from a remote node as follows from the terminal
>
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
>
> when the app is running , when I press ctrl-C on the console terminal,
> then the process is killed and so is the app in the spark master UI. When I
> go to spark master ui, i see that this app is in state Killed under
> Completed applications, which is what I expected to see.
>
> Now, I created a shell script as follows to do the same
>
> #!/bin/bash
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
> echo $! > my.pid
>
> When I execute the shell script from terminal, as follows
>
> $> bash myscript.sh
>
> The application is submitted correctly to spark master and I can see it as
> one of the running apps in teh spark master ui. But when I kill the process
> in my terminal as follows
>
> $> ps kill $(cat my.pid)
>
> I see that the process is killed on my machine but the spark appliation is
> still running in spark master! It doesn't get killed.
>
> I noticed one more thing that, when I launch the spark job via shell
> script and kill the application from spark master UI by clicking on "kill"
> next to the running application, it gets killed in spark ui but I still see
> the process running in my machine.
>
> In both cases, I would expect the remote spark app to be killed and my
> local process to be killed.
>
> Why is this happening? and how can I kill a spark app from the terminal
> launced via shell script w.o going to the spark master UI?
>
> I want to launch the spark app via script and log the pid so i can monitor
> it remotely
>
> thanks for the help
>
>


Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Thanks for the info Stephane.
Why does CTRL-C in the terminal running spark-submit kills the app in spark
master correctly w/o any explicit shutdown hooks in the code? Can you
explain why we need to add the shutdown hook to kill it when launched via a
shell script ?
For the second issue, I'm not using any thread pool. So not sure why
killing the app in spark UI doesn't kill the process launched via script

On Friday, November 20, 2015, Stéphane Verlet 
wrote:

> I solved the first issue by adding a shutdown hook in my code. The
> shutdown hook get call when you exit your script (ctrl-C , kill … but nor
> kill -9)
>
> val shutdownHook = scala.sys.addShutdownHook {
> try {
>
> sparkContext.stop()
> //Make sure to kill any other threads or thread pool you may be running
>   }
>   catch {
> case e: Exception =>
>   {
> ...
>
>   }
>   }
>
> }
>
> For the other issue , kill from the UI. I also had the issue. This was
> caused by a thread pool that I use.
>
> So I surrounded my code with try/finally block to guarantee that the
> thread pool was shutdown when spark stopped
>
> I hopes this help
>
> Stephane
> ​
>
> On Fri, Nov 20, 2015 at 7:46 PM, Vikram Kone  > wrote:
>
>> Hi,
>> I'm seeing a strange problem. I have a spark cluster in standalone mode.
>> I submit spark jobs from a remote node as follows from the terminal
>>
>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>> spark-jobs.jar
>>
>> when the app is running , when I press ctrl-C on the console terminal,
>> then the process is killed and so is the app in the spark master UI. When I
>> go to spark master ui, i see that this app is in state Killed under
>> Completed applications, which is what I expected to see.
>>
>> Now, I created a shell script as follows to do the same
>>
>> #!/bin/bash
>> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
>> spark-jobs.jar
>> echo $! > my.pid
>>
>> When I execute the shell script from terminal, as follows
>>
>> $> bash myscript.sh
>>
>> The application is submitted correctly to spark master and I can see it
>> as one of the running apps in teh spark master ui. But when I kill the
>> process in my terminal as follows
>>
>> $> ps kill $(cat my.pid)
>>
>> I see that the process is killed on my machine but the spark appliation
>> is still running in spark master! It doesn't get killed.
>>
>> I noticed one more thing that, when I launch the spark job via shell
>> script and kill the application from spark master UI by clicking on "kill"
>> next to the running application, it gets killed in spark ui but I still see
>> the process running in my machine.
>>
>> In both cases, I would expect the remote spark app to be killed and my
>> local process to be killed.
>>
>> Why is this happening? and how can I kill a spark app from the terminal
>> launced via shell script w.o going to the spark master UI?
>>
>> I want to launch the spark app via script and log the pid so i can
>> monitor it remotely
>>
>> thanks for the help
>>
>>
>


Re: How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Spark 1.4.1

On Friday, November 20, 2015, Ted Yu  wrote:

> Which Spark release are you using ?
>
> Can you pastebin the stack trace of the process running on your machine ?
>
> Thanks
>
> On Nov 20, 2015, at 6:46 PM, Vikram Kone  > wrote:
>
> Hi,
> I'm seeing a strange problem. I have a spark cluster in standalone mode. I
> submit spark jobs from a remote node as follows from the terminal
>
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
>
> when the app is running , when I press ctrl-C on the console terminal,
> then the process is killed and so is the app in the spark master UI. When I
> go to spark master ui, i see that this app is in state Killed under
> Completed applications, which is what I expected to see.
>
> Now, I created a shell script as follows to do the same
>
> #!/bin/bash
> spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
> spark-jobs.jar
> echo $! > my.pid
>
> When I execute the shell script from terminal, as follows
>
> $> bash myscript.sh
>
> The application is submitted correctly to spark master and I can see it as
> one of the running apps in teh spark master ui. But when I kill the process
> in my terminal as follows
>
> $> ps kill $(cat my.pid)
>
> I see that the process is killed on my machine but the spark appliation is
> still running in spark master! It doesn't get killed.
>
> I noticed one more thing that, when I launch the spark job via shell
> script and kill the application from spark master UI by clicking on "kill"
> next to the running application, it gets killed in spark ui but I still see
> the process running in my machine.
>
> In both cases, I would expect the remote spark app to be killed and my
> local process to be killed.
>
> Why is this happening? and how can I kill a spark app from the terminal
> launced via shell script w.o going to the spark master UI?
>
> I want to launch the spark app via script and log the pid so i can monitor
> it remotely
>
> thanks for the help
>
>


How to kill spark applications submitted using spark-submit reliably?

2015-11-20 Thread Vikram Kone
Hi,
I'm seeing a strange problem. I have a spark cluster in standalone mode. I
submit spark jobs from a remote node as follows from the terminal

spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
spark-jobs.jar

when the app is running , when I press ctrl-C on the console terminal, then
the process is killed and so is the app in the spark master UI. When I go
to spark master ui, i see that this app is in state Killed under Completed
applications, which is what I expected to see.

Now, I created a shell script as follows to do the same

#!/bin/bash
spark-submit --master spark://10.1.40.18:7077  --class com.test.Ping
spark-jobs.jar
echo $! > my.pid

When I execute the shell script from terminal, as follows

$> bash myscript.sh

The application is submitted correctly to spark master and I can see it as
one of the running apps in teh spark master ui. But when I kill the process
in my terminal as follows

$> ps kill $(cat my.pid)

I see that the process is killed on my machine but the spark appliation is
still running in spark master! It doesn't get killed.

I noticed one more thing that, when I launch the spark job via shell script
and kill the application from spark master UI by clicking on "kill" next to
the running application, it gets killed in spark ui but I still see the
process running in my machine.

In both cases, I would expect the remote spark app to be killed and my
local process to be killed.

Why is this happening? and how can I kill a spark app from the terminal
launced via shell script w.o going to the spark master UI?

I want to launch the spark app via script and log the pid so i can monitor
it remotely

thanks for the help


Re: Spark job workflow engine recommendations

2015-11-18 Thread Vikram Kone
Hi Feng,
Does airflow allow remote submissions of spark jobs via spark-submit?

On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu 
wrote:

> Hi,
>
> we use ‘Airflow'  as our job workflow scheduler.
>
>
>
>
> On Nov 19, 2015, at 9:47 AM, Vikram Kone  wrote:
>
> Hi Nick,
> Quick question about spark-submit command executed from azkaban with
> command job type.
> I see that when I press kill in azkaban portal on a spark-submit job, it
> doesn't actually kill the application on spark master and it continues to
> run even though azkaban thinks that it's killed.
> How do you get around this? Is there a way to kill the spark-submit jobs
> from azkaban portal?
>
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath 
> wrote:
>
>> Hi Vikram,
>>
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
>> local mode deployment and it is fairly easy to set up. It is pretty easy to
>> use and has a nice scheduling and logging interface, as well as SLAs (like
>> kill job and notify if it doesn't complete in 3 hours or whatever).
>>
>> However Spark support is not present directly - we run everything with
>> shell scripts and spark-submit. There is a plugin interface where one could
>> create a Spark plugin, but I found it very cumbersome when I did
>> investigate and didn't have the time to work through it to develop that.
>>
>> It has some quirks and while there is actually a REST API for adding jos
>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>> have to figure it out for yourself. But in terms of ease of use I found it
>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>> involved to set up. Haven't tried Luigi either.
>>
>> Spark job server is good but as you say lacks some stuff like scheduling
>> and DAG type workflows (independent of spark-defined job flows).
>>
>>
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke  wrote:
>>
>>> Check also falcon in combination with oozie
>>>
>>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>>> écrit :
>>>
>>>> Looks like Oozie can satisfy most of your requirements.
>>>>
>>>>
>>>>
>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are 
>>>>> tonnes
>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>> wanted to check with people here to see what they are using today.
>>>>>
>>>>> Some of the requirements of the workflow engine that I'm looking for
>>>>> are
>>>>>
>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>> some wrapper Java code to submit tasks.
>>>>> 2. Active open source community support and well tested at production
>>>>> scale.
>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>> C are finished. Don't need to write full blown java applications to 
>>>>> specify
>>>>> job parameters and dependencies. Should be very simple to use.
>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>> time every hour or day or week or month.
>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>> daily basis.
>>>>>
>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>> server and workflow engine would be ideal
>>>>>
>>>>> Thanks for the inputs
>>>>>
>>>>
>>>>
>>
>
>


Re: Spark job workflow engine recommendations

2015-11-18 Thread Vikram Kone
Hi Nick,
Quick question about spark-submit command executed from azkaban with
command job type.
I see that when I press kill in azkaban portal on a spark-submit job, it
doesn't actually kill the application on spark master and it continues to
run even though azkaban thinks that it's killed.
How do you get around this? Is there a way to kill the spark-submit jobs
from azkaban portal?

On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath 
wrote:

> Hi Vikram,
>
> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
> local mode deployment and it is fairly easy to set up. It is pretty easy to
> use and has a nice scheduling and logging interface, as well as SLAs (like
> kill job and notify if it doesn't complete in 3 hours or whatever).
>
> However Spark support is not present directly - we run everything with
> shell scripts and spark-submit. There is a plugin interface where one could
> create a Spark plugin, but I found it very cumbersome when I did
> investigate and didn't have the time to work through it to develop that.
>
> It has some quirks and while there is actually a REST API for adding jos
> and dynamically scheduling jobs, it is not documented anywhere so you kinda
> have to figure it out for yourself. But in terms of ease of use I found it
> way better than Oozie. I haven't tried Chronos, and it seemed quite
> involved to set up. Haven't tried Luigi either.
>
> Spark job server is good but as you say lacks some stuff like scheduling
> and DAG type workflows (independent of spark-defined job flows).
>
>
> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke  wrote:
>
>> Check also falcon in combination with oozie
>>
>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>> écrit :
>>
>>> Looks like Oozie can satisfy most of your requirements.
>>>
>>>
>>>
>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
>>> wrote:
>>>
>>>> Hi,
>>>> I'm looking for open source workflow tools/engines that allow us to
>>>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>> wanted to check with people here to see what they are using today.
>>>>
>>>> Some of the requirements of the workflow engine that I'm looking for are
>>>>
>>>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>>>> wrapper Java code to submit tasks.
>>>> 2. Active open source community support and well tested at production
>>>> scale.
>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>> C are finished. Don't need to write full blown java applications to specify
>>>> job parameters and dependencies. Should be very simple to use.
>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>>>> every hour or day or week or month.
>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>> daily basis.
>>>>
>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>> server and workflow engine would be ideal
>>>>
>>>> Thanks for the inputs
>>>>
>>>
>>>
>


Re: Spark job workflow engine recommendations

2015-10-07 Thread Vikram Kone
Hien,
I saw this pull request and from what I understand this is geared towards
running spark jobs over hadoop. We are using spark over cassandra and not
sure if this new jobtype supports that. I haven't seen any documentation in
regards to how to use this spark job plugin, so that I can test it out on
our cluster.
We are currently submitting our spark jobs using command job type using the
following command  "dse spark-submit --class com.org.classname ./test.jar"
etc. What would be the advantage of using the native spark job type over
command job type?

I didn't understand from your reply if azkaban already supports long
running jobs like spark streaming..does it? streaming jobs generally need
to be running indefinitely or forever and needs to be restarted if for some
reason they fail (lack of resources may be..). I can probably use the auto
retry feature for this, but not sure

I'm looking forward to the multiple executor support which will greatly
enhance the scalability issue.

On Wed, Oct 7, 2015 at 9:56 AM, Hien Luu  wrote:

> The spark job type was added recently - see this pull request
> https://github.com/azkaban/azkaban-plugins/pull/195.  You can leverage
> the SLA feature to kill a job if it ran longer than expected.
>
> BTW, we just solved the scalability issue by supporting multiple
> executors.  Within a week or two, the code for that should be merged in the
> main trunk.
>
> Hien
>
> On Tue, Oct 6, 2015 at 9:40 PM, Vikram Kone  wrote:
>
>> Does Azkaban support scheduling long running jobs like spark steaming
>> jobs? Will Azkaban kill a job if it's running for a long time.
>>
>>
>> On Friday, August 7, 2015, Vikram Kone  wrote:
>>
>>> Hien,
>>> Is Azkaban being phased out at linkedin as rumored? If so, what's
>>> linkedin going to use for workflow scheduling? Is there something else
>>> that's going to replace Azkaban?
>>>
>>> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu  wrote:
>>>
>>>> In my opinion, choosing some particular project among its peers should
>>>> leave enough room for future growth (which may come faster than you
>>>> initially think).
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu  wrote:
>>>>
>>>>> Scalability is a known issue due the the current architecture.
>>>>> However this will be applicable if you run more 20K jobs per day.
>>>>>
>>>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu  wrote:
>>>>>
>>>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban
>>>>>> is being phased out at LinkedIn because of scalability issues (though
>>>>>> UI-wise, Azkaban seems better).
>>>>>>
>>>>>> Vikram:
>>>>>> I suggest you do more research in related projects (maybe using their
>>>>>> mailing lists).
>>>>>>
>>>>>> Disclaimer: I don't work for LinkedIn.
>>>>>>
>>>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>>>> nick.pentre...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Vikram,
>>>>>>>
>>>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We
>>>>>>> just use local mode deployment and it is fairly easy to set up. It is
>>>>>>> pretty easy to use and has a nice scheduling and logging interface, as 
>>>>>>> well
>>>>>>> as SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>>>> whatever).
>>>>>>>
>>>>>>> However Spark support is not present directly - we run everything
>>>>>>> with shell scripts and spark-submit. There is a plugin interface where 
>>>>>>> one
>>>>>>> could create a Spark plugin, but I found it very cumbersome when I did
>>>>>>> investigate and didn't have the time to work through it to develop that.
>>>>>>>
>>>>>>> It has some quirks and while there is actually a REST API for adding
>>>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so 
>>>>>>> you
>>>>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>>>> quite involved to set up. Haven'

Re: Spark job workflow engine recommendations

2015-10-06 Thread Vikram Kone
Does Azkaban support scheduling long running jobs like spark steaming jobs?
Will Azkaban kill a job if it's running for a long time.

On Friday, August 7, 2015, Vikram Kone  wrote:

> Hien,
> Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin
> going to use for workflow scheduling? Is there something else that's going
> to replace Azkaban?
>
> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu  > wrote:
>
>> In my opinion, choosing some particular project among its peers should
>> leave enough room for future growth (which may come faster than you
>> initially think).
>>
>> Cheers
>>
>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu > > wrote:
>>
>>> Scalability is a known issue due the the current architecture.  However
>>> this will be applicable if you run more 20K jobs per day.
>>>
>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu >> > wrote:
>>>
>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>>>> Azkaban seems better).
>>>>
>>>> Vikram:
>>>> I suggest you do more research in related projects (maybe using their
>>>> mailing lists).
>>>>
>>>> Disclaimer: I don't work for LinkedIn.
>>>>
>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Vikram,
>>>>>
>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>>>> easy to use and has a nice scheduling and logging interface, as well as
>>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>> whatever).
>>>>>
>>>>> However Spark support is not present directly - we run everything with
>>>>> shell scripts and spark-submit. There is a plugin interface where one 
>>>>> could
>>>>> create a Spark plugin, but I found it very cumbersome when I did
>>>>> investigate and didn't have the time to work through it to develop that.
>>>>>
>>>>> It has some quirks and while there is actually a REST API for adding
>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>>
>>>>> Spark job server is good but as you say lacks some stuff like
>>>>> scheduling and DAG type workflows (independent of spark-defined job 
>>>>> flows).
>>>>>
>>>>>
>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke >>>> > wrote:
>>>>>
>>>>>> Check also falcon in combination with oozie
>>>>>>
>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>>>>>> écrit :
>>>>>>
>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone >>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are 
>>>>>>>> tonnes
>>>>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>>>>> wanted to check with people here to see what they are using today.
>>>>>>>>
>>>>>>>> Some of the requirements of the workflow engine that I'm looking
>>>>>>>> for are
>>>>>>>>
>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>>>>> some wrapper Java code to submit tasks.
>>>>>>>> 2. Active open source community support and well tested at
>>>>>>>> production scale.
>>>>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B 
>>>>>>>> and
>>>>>>>> C are finished. Don't need to write full blown java applications to 
>>>>>>>> specify
>>>>>>>> job parameters and dependencies. Should be very simple to use.
>>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>>>>> time every hour or day or week or month.
>>>>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>>>>> daily basis.
>>>>>>>>
>>>>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>>>>> towards making spark jobs run faster by sharing contexts between the 
>>>>>>>> jobs
>>>>>>>> but isn't a full blown workflow engine per se. A combination of spark 
>>>>>>>> job
>>>>>>>> server and workflow engine would be ideal
>>>>>>>>
>>>>>>>> Thanks for the inputs
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Notification on Spark Streaming job failure

2015-10-06 Thread Vikram Kone
We are using Monit to kick off spark streaming jobs n seems to work fine.

On Monday, September 28, 2015, Chen Song  wrote:

> I am also interested specifically in monitoring and alerting on Spark
> streaming jobs. It will be helpful to get some general guidelines or advice
> on this, from people who implemented anything on this.
>
> On Fri, Sep 18, 2015 at 2:35 AM, Krzysztof Zarzycki  > wrote:
>
>> Hi there Spark Community,
>> I would like to ask you for an advice: I'm running Spark Streaming jobs
>> in production. Sometimes these jobs fail and I would like to get email
>> notification about it. Do you know how I can set up Spark to notify me by
>> email if my job fails? Or do I have to use external monitoring tool?
>> I'm thinking of the following options:
>> 1. As I'm running those jobs on YARN, monitor somehow YARN jobs. Looked
>> for it as well but couldn't find any YARN feature to do it.
>> 2. Run Spark Streaming job in some scheduler, like Oozie, Azkaban, Luigi.
>> Those are created rather for batch jobs, not streaming, but could work. Has
>> anyone tried that?
>> 3. Run job driver under "monit" tool and catch the failure and send an
>> email about it. Currently I'm deploying with yarn-cluster mode and I would
>> need to resign from it to run under monit
>> 4. Implement monitoring tool (like Graphite, Ganglia, Prometheus) and use
>> Spark metrics. And then implement alerting in those. Can I get information
>> of failed jobs in Spark metrics?
>> 5. As 4. but implement my own custom job metrics and monitor them.
>>
>> What's your opinion about my options? How do you people solve this
>> problem? Anything Spark specific?
>> I'll be grateful for any advice in this subject.
>> Thanks!
>> Krzysiek
>>
>>
>
>
> --
> Chen Song
>
>


How to run spark in standalone mode on cassandra with high availability?

2015-08-15 Thread Vikram Kone
Hi,
We are planning to install Spark in stand alone mode on cassandra cluster.
The problem, is since Cassandra has a no-SPOF architecture ie any node can
become the master for the cluster, it creates the problem for Spark master
since it's not a peer-peer architecture where any node can become the
master.

What are our options here? Are there any framworks or tools out there that
would allow any application to run on a cluster of machines with high
availablity?


Re: Spark job workflow engine recommendations

2015-08-11 Thread Vikram Kone
r job does not
start?
* Do you need high availability for job scheduling? That will require
additional components.


This became a bit of a brain dump on the topic. I hope that it is
useful. Don't hesitate to get back if I can help.

Regards,

Lars Albertsson



On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone  wrote:
> Hi,
> I'm looking for open source workflow tools/engines that allow us to schedule
> spark jobs on a datastax cassandra cluster. Since there are tonnes of
> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to
> check with people here to see what they are using today.
>
> Some of the requirements of the workflow engine that I'm looking for are
>
> 1. First class support for submitting Spark jobs on Cassandra. Not some
> wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production scale.
> 3. Should be dead easy to write job dependencices using XML or web interface
> . Ex; job A depends on Job B and Job C, so run Job A after B and C are
> finished. Don't need to write full blown java applications to specify job
> parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
> every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on daily
> basis.
>
> I have looked at Ooyala's spark job server which seems to be hated towards
> making spark jobs run faster by sharing contexts between the jobs but isn't
> a full blown workflow engine per se. A combination of spark job server and
> workflow engine would be ideal
>
> Thanks for the inputs

Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Hien,
Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin
going to use for workflow scheduling? Is there something else that's going
to replace Azkaban?

On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu  wrote:

> In my opinion, choosing some particular project among its peers should
> leave enough room for future growth (which may come faster than you
> initially think).
>
> Cheers
>
> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu  wrote:
>
>> Scalability is a known issue due the the current architecture.  However
>> this will be applicable if you run more 20K jobs per day.
>>
>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu  wrote:
>>
>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>>> Azkaban seems better).
>>>
>>> Vikram:
>>> I suggest you do more research in related projects (maybe using their
>>> mailing lists).
>>>
>>> Disclaimer: I don't work for LinkedIn.
>>>
>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> Hi Vikram,
>>>>
>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>>> easy to use and has a nice scheduling and logging interface, as well as
>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>> whatever).
>>>>
>>>> However Spark support is not present directly - we run everything with
>>>> shell scripts and spark-submit. There is a plugin interface where one could
>>>> create a Spark plugin, but I found it very cumbersome when I did
>>>> investigate and didn't have the time to work through it to develop that.
>>>>
>>>> It has some quirks and while there is actually a REST API for adding
>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>> quite involved to set up. Haven't tried Luigi either.
>>>>
>>>> Spark job server is good but as you say lacks some stuff like
>>>> scheduling and DAG type workflows (independent of spark-defined job flows).
>>>>
>>>>
>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke 
>>>> wrote:
>>>>
>>>>> Check also falcon in combination with oozie
>>>>>
>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>>>>> écrit :
>>>>>
>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are 
>>>>>>> tonnes
>>>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>>>> wanted to check with people here to see what they are using today.
>>>>>>>
>>>>>>> Some of the requirements of the workflow engine that I'm looking for
>>>>>>> are
>>>>>>>
>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>>>> some wrapper Java code to submit tasks.
>>>>>>> 2. Active open source community support and well tested at
>>>>>>> production scale.
>>>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B 
>>>>>>> and
>>>>>>> C are finished. Don't need to write full blown java applications to 
>>>>>>> specify
>>>>>>> job parameters and dependencies. Should be very simple to use.
>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>>>> time every hour or day or week or month.
>>>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>>>> daily basis.
>>>>>>>
>>>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>>>> towards making spark jobs run faster by sharing contexts between the 
>>>>>>> jobs
>>>>>>> but isn't a full blown workflow engine per se. A combination of spark 
>>>>>>> job
>>>>>>> server and workflow engine would be ideal
>>>>>>>
>>>>>>> Thanks for the inputs
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>


Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Oh ok. That's a good enough reason against azkaban then. So looks like
Oozie is the best choice here.

On Friday, August 7, 2015, Ted Yu  wrote:

> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
> being phased out at LinkedIn because of scalability issues (though UI-wise,
> Azkaban seems better).
>
> Vikram:
> I suggest you do more research in related projects (maybe using their
> mailing lists).
>
> Disclaimer: I don't work for LinkedIn.
>
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath  > wrote:
>
>> Hi Vikram,
>>
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
>> local mode deployment and it is fairly easy to set up. It is pretty easy to
>> use and has a nice scheduling and logging interface, as well as SLAs (like
>> kill job and notify if it doesn't complete in 3 hours or whatever).
>>
>> However Spark support is not present directly - we run everything with
>> shell scripts and spark-submit. There is a plugin interface where one could
>> create a Spark plugin, but I found it very cumbersome when I did
>> investigate and didn't have the time to work through it to develop that.
>>
>> It has some quirks and while there is actually a REST API for adding jos
>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>> have to figure it out for yourself. But in terms of ease of use I found it
>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>> involved to set up. Haven't tried Luigi either.
>>
>> Spark job server is good but as you say lacks some stuff like scheduling
>> and DAG type workflows (independent of spark-defined job flows).
>>
>>
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke > > wrote:
>>
>>> Check also falcon in combination with oozie
>>>
>>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>>> écrit :
>>>
>>>> Looks like Oozie can satisfy most of your requirements.
>>>>
>>>>
>>>>
>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone >>> > wrote:
>>>>
>>>>> Hi,
>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are 
>>>>> tonnes
>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>> wanted to check with people here to see what they are using today.
>>>>>
>>>>> Some of the requirements of the workflow engine that I'm looking for
>>>>> are
>>>>>
>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>> some wrapper Java code to submit tasks.
>>>>> 2. Active open source community support and well tested at production
>>>>> scale.
>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>>>>> C are finished. Don't need to write full blown java applications to 
>>>>> specify
>>>>> job parameters and dependencies. Should be very simple to use.
>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>> time every hour or day or week or month.
>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>> daily basis.
>>>>>
>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>> towards making spark jobs run faster by sharing contexts between the jobs
>>>>> but isn't a full blown workflow engine per se. A combination of spark job
>>>>> server and workflow engine would be ideal
>>>>>
>>>>> Thanks for the inputs
>>>>>
>>>>
>>>>
>>
>


Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Thanks for the suggestion Hien. I'm curious why not azkaban from linkedin.
>From what I read online Oozie was very cumbersome to setup and use compared
to azkaban. Since you are from linkedin wanted to get some perspective on
what it lacks compared to Oozie. Ease of use is very important more than
full feature set

On Friday, August 7, 2015, Hien Luu  wrote:

> Looks like Oozie can satisfy most of your requirements.
>
>
>
> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone  > wrote:
>
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to
>> schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>> wanted to check with people here to see what they are using today.
>>
>> Some of the requirements of the workflow engine that I'm looking for are
>>
>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>> wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production
>> scale.
>> 3. Should be dead easy to write job dependencices using XML or web
>> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
>> C are finished. Don't need to write full blown java applications to specify
>> job parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>> every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily
>> basis.
>>
>> I have looked at Ooyala's spark job server which seems to be hated
>> towards making spark jobs run faster by sharing contexts between the jobs
>> but isn't a full blown workflow engine per se. A combination of spark job
>> server and workflow engine would be ideal
>>
>> Thanks for the inputs
>>
>
>


Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Hi,
I'm looking for open source workflow tools/engines that allow us to
schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
wanted to check with people here to see what they are using today.

Some of the requirements of the workflow engine that I'm looking for are

1. First class support for submitting Spark jobs on Cassandra. Not some
wrapper Java code to submit tasks.
2. Active open source community support and well tested at production scale.
3. Should be dead easy to write job dependencices using XML or web
interface . Ex; job A depends on Job B and Job C, so run Job A after B and
C are finished. Don't need to write full blown java applications to specify
job parameters and dependencies. Should be very simple to use.
4. Time based  recurrent scheduling. Run the spark jobs at a given time
every hour or day or week or month.
5. Job monitoring, alerting on failures and email notifications on daily
basis.

I have looked at Ooyala's spark job server which seems to be hated towards
making spark jobs run faster by sharing contexts between the jobs but isn't
a full blown workflow engine per se. A combination of spark job server and
workflow engine would be ideal

Thanks for the inputs