Re: Can we use spark inside a web service?

2016-03-15 Thread Andrés Ivaldi
Thanks Evan for the points. I had supposed what you said, but as I don't
have enough experience maybe I was missing something, thanks for the
answer!!

On Mon, Mar 14, 2016 at 7:22 PM, Evan Chan  wrote:

> Andres,
>
> A couple points:
>
> 1) If you look at my post, you can see that you could use Spark for
> low-latency - many sub-second queries could be executed in under a
> second, with the right technology.  It really depends on "real time"
> definition, but I believe low latency is definitely possible.
> 2) Akka-http over SparkContext - this is essentially what Spark Job
> Server does.  (it uses Spray, whic is the predecessor to akka-http
> we will upgrade once Spark 2.0 is incorporated)
> 3) Someone else can probably talk about Ignite, but it is based on a
> distributed object cache. So you define your objects in Java, POJOs,
> annotate which ones you want indexed, upload your jars, then you can
> execute queries.   It's a different use case than typical OLAP.
> There is some Spark integration, but then you would have the same
> bottlenecks going through Spark.
>
>
> On Fri, Mar 11, 2016 at 5:02 AM, Andrés Ivaldi  wrote:
> > nice discussion , I've a question about  Web Service with Spark.
> >
> > What Could be the problem using Akka-http as web service (Like play does
> ) ,
> > with one SparkContext created , and the queries over -http akka using
> only
> > the instance of  that SparkContext ,
> >
> > Also about Analytics , we are working on real- time Analytics and as
> Hemant
> > said Spark is not a solution for low latency queries. What about using
> > Ingite for that?
> >
> >
> > On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat 
> > wrote:
> >>
> >> Spark-jobserver is an elegant product that builds concurrency on top of
> >> Spark. But, the current design of DAGScheduler prevents Spark to become
> a
> >> truly concurrent solution for low latency queries. DagScheduler will
> turn
> >> out to be a bottleneck for low latency queries. Sparrow project was an
> >> effort to make Spark more suitable for such scenarios but it never made
> it
> >> to the Spark codebase. If Spark has to become a highly concurrent
> solution,
> >> scheduling has to be distributed.
> >>
> >> Hemant Bhanawat
> >> www.snappydata.io
> >>
> >> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly  wrote:
> >>>
> >>> great discussion, indeed.
> >>>
> >>> Mark Hamstra and i spoke offline just now.
> >>>
> >>> Below is a quick recap of our discussion on how they've achieved
> >>> acceptable performance from Spark on the user request/response path
> (@mark-
> >>> feel free to correct/comment).
> >>>
> >>> 1) there is a big difference in request/response latency between
> >>> submitting a full Spark Application (heavy weight) versus having a
> >>> long-running Spark Application (like Spark Job Server) that submits
> >>> lighter-weight Jobs using a shared SparkContext.  mark is obviously
> using
> >>> the latter - a long-running Spark App.
> >>>
> >>> 2) there are some enhancements to Spark that are required to achieve
> >>> acceptable user request/response times.  some links that Mark provided
> are
> >>> as follows:
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-11838
> >>> https://github.com/apache/spark/pull/11036
> >>> https://github.com/apache/spark/pull/11403
> >>> https://issues.apache.org/jira/browse/SPARK-13523
> >>> https://issues.apache.org/jira/browse/SPARK-13756
> >>>
> >>> Essentially, a deeper level of caching at the shuffle file layer to
> >>> reduce compute and memory between queries.
> >>>
> >>> Note that Mark is running a slightly-modified version of stock Spark.
> >>> (He's mentioned this in prior posts, as well.)
> >>>
> >>> And I have to say that I'm, personally, seeing more and more
> >>> slightly-modified versions of Spark being deployed to production to
> >>> workaround outstanding PR's and Jiras.
> >>>
> >>> this may not be what people want to hear, but it's a trend that i'm
> >>> seeing lately as more and more customize Spark to their specific use
> cases.
> >>>
> >>> Anyway, thanks for the good discussion, everyone!  This is why we have
> >>> these lists, right!  :)
> >>>
> >>>
> >>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan 
> >>> wrote:
> 
>  One of the premises here is that if you can restrict your workload to
>  fewer cores - which is easier with FiloDB and careful data modeling -
>  you can make this work for much higher concurrency and lower latency
>  than most typical Spark use cases.
> 
>  The reason why it typically does not work in production is that most
>  people are using HDFS and files.  These data sources are designed for
>  running queries and workloads on all your cores across many workers,
>  and not for filtering your workload down to only one or two cores.
> 
>  There is actually nothing inherent in Spark that prevents people 

Re: Can we use spark inside a web service?

2016-03-14 Thread Evan Chan
Andres,

A couple points:

1) If you look at my post, you can see that you could use Spark for
low-latency - many sub-second queries could be executed in under a
second, with the right technology.  It really depends on "real time"
definition, but I believe low latency is definitely possible.
2) Akka-http over SparkContext - this is essentially what Spark Job
Server does.  (it uses Spray, whic is the predecessor to akka-http
we will upgrade once Spark 2.0 is incorporated)
3) Someone else can probably talk about Ignite, but it is based on a
distributed object cache. So you define your objects in Java, POJOs,
annotate which ones you want indexed, upload your jars, then you can
execute queries.   It's a different use case than typical OLAP.
There is some Spark integration, but then you would have the same
bottlenecks going through Spark.


On Fri, Mar 11, 2016 at 5:02 AM, Andrés Ivaldi  wrote:
> nice discussion , I've a question about  Web Service with Spark.
>
> What Could be the problem using Akka-http as web service (Like play does ) ,
> with one SparkContext created , and the queries over -http akka using only
> the instance of  that SparkContext ,
>
> Also about Analytics , we are working on real- time Analytics and as Hemant
> said Spark is not a solution for low latency queries. What about using
> Ingite for that?
>
>
> On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat 
> wrote:
>>
>> Spark-jobserver is an elegant product that builds concurrency on top of
>> Spark. But, the current design of DAGScheduler prevents Spark to become a
>> truly concurrent solution for low latency queries. DagScheduler will turn
>> out to be a bottleneck for low latency queries. Sparrow project was an
>> effort to make Spark more suitable for such scenarios but it never made it
>> to the Spark codebase. If Spark has to become a highly concurrent solution,
>> scheduling has to be distributed.
>>
>> Hemant Bhanawat
>> www.snappydata.io
>>
>> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly  wrote:
>>>
>>> great discussion, indeed.
>>>
>>> Mark Hamstra and i spoke offline just now.
>>>
>>> Below is a quick recap of our discussion on how they've achieved
>>> acceptable performance from Spark on the user request/response path (@mark-
>>> feel free to correct/comment).
>>>
>>> 1) there is a big difference in request/response latency between
>>> submitting a full Spark Application (heavy weight) versus having a
>>> long-running Spark Application (like Spark Job Server) that submits
>>> lighter-weight Jobs using a shared SparkContext.  mark is obviously using
>>> the latter - a long-running Spark App.
>>>
>>> 2) there are some enhancements to Spark that are required to achieve
>>> acceptable user request/response times.  some links that Mark provided are
>>> as follows:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-11838
>>> https://github.com/apache/spark/pull/11036
>>> https://github.com/apache/spark/pull/11403
>>> https://issues.apache.org/jira/browse/SPARK-13523
>>> https://issues.apache.org/jira/browse/SPARK-13756
>>>
>>> Essentially, a deeper level of caching at the shuffle file layer to
>>> reduce compute and memory between queries.
>>>
>>> Note that Mark is running a slightly-modified version of stock Spark.
>>> (He's mentioned this in prior posts, as well.)
>>>
>>> And I have to say that I'm, personally, seeing more and more
>>> slightly-modified versions of Spark being deployed to production to
>>> workaround outstanding PR's and Jiras.
>>>
>>> this may not be what people want to hear, but it's a trend that i'm
>>> seeing lately as more and more customize Spark to their specific use cases.
>>>
>>> Anyway, thanks for the good discussion, everyone!  This is why we have
>>> these lists, right!  :)
>>>
>>>
>>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan 
>>> wrote:

 One of the premises here is that if you can restrict your workload to
 fewer cores - which is easier with FiloDB and careful data modeling -
 you can make this work for much higher concurrency and lower latency
 than most typical Spark use cases.

 The reason why it typically does not work in production is that most
 people are using HDFS and files.  These data sources are designed for
 running queries and workloads on all your cores across many workers,
 and not for filtering your workload down to only one or two cores.

 There is actually nothing inherent in Spark that prevents people from
 using it as an app server.   However, the insistence on using it with
 HDFS is what kills concurrency.   This is why FiloDB is important.

 I agree there are more optimized stacks for running app servers, but
 the choices that you mentioned:  ES is targeted at text search;  Cass
 and HBase by themselves are not fast enough for analytical queries
 that the OP wants;  and MySQL is great but not scalable.   Probably

Re: Can we use spark inside a web service?

2016-03-14 Thread Evan Chan
At least for simple queries, the DAGScheduler does not appear to be
the bottleneck - since we are able to schedule 700 queries, and all
the scheduling is probably done from the main application thread.

However, I did have high hopes for Sparrow.  What was the reason they
decided not to include that?

On Fri, Mar 11, 2016 at 1:52 AM, Hemant Bhanawat  wrote:
> Spark-jobserver is an elegant product that builds concurrency on top of
> Spark. But, the current design of DAGScheduler prevents Spark to become a
> truly concurrent solution for low latency queries. DagScheduler will turn
> out to be a bottleneck for low latency queries. Sparrow project was an
> effort to make Spark more suitable for such scenarios but it never made it
> to the Spark codebase. If Spark has to become a highly concurrent solution,
> scheduling has to be distributed.
>
> Hemant Bhanawat
> www.snappydata.io
>
> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly  wrote:
>>
>> great discussion, indeed.
>>
>> Mark Hamstra and i spoke offline just now.
>>
>> Below is a quick recap of our discussion on how they've achieved
>> acceptable performance from Spark on the user request/response path (@mark-
>> feel free to correct/comment).
>>
>> 1) there is a big difference in request/response latency between
>> submitting a full Spark Application (heavy weight) versus having a
>> long-running Spark Application (like Spark Job Server) that submits
>> lighter-weight Jobs using a shared SparkContext.  mark is obviously using
>> the latter - a long-running Spark App.
>>
>> 2) there are some enhancements to Spark that are required to achieve
>> acceptable user request/response times.  some links that Mark provided are
>> as follows:
>>
>> https://issues.apache.org/jira/browse/SPARK-11838
>> https://github.com/apache/spark/pull/11036
>> https://github.com/apache/spark/pull/11403
>> https://issues.apache.org/jira/browse/SPARK-13523
>> https://issues.apache.org/jira/browse/SPARK-13756
>>
>> Essentially, a deeper level of caching at the shuffle file layer to reduce
>> compute and memory between queries.
>>
>> Note that Mark is running a slightly-modified version of stock Spark.
>> (He's mentioned this in prior posts, as well.)
>>
>> And I have to say that I'm, personally, seeing more and more
>> slightly-modified versions of Spark being deployed to production to
>> workaround outstanding PR's and Jiras.
>>
>> this may not be what people want to hear, but it's a trend that i'm seeing
>> lately as more and more customize Spark to their specific use cases.
>>
>> Anyway, thanks for the good discussion, everyone!  This is why we have
>> these lists, right!  :)
>>
>>
>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan 
>> wrote:
>>>
>>> One of the premises here is that if you can restrict your workload to
>>> fewer cores - which is easier with FiloDB and careful data modeling -
>>> you can make this work for much higher concurrency and lower latency
>>> than most typical Spark use cases.
>>>
>>> The reason why it typically does not work in production is that most
>>> people are using HDFS and files.  These data sources are designed for
>>> running queries and workloads on all your cores across many workers,
>>> and not for filtering your workload down to only one or two cores.
>>>
>>> There is actually nothing inherent in Spark that prevents people from
>>> using it as an app server.   However, the insistence on using it with
>>> HDFS is what kills concurrency.   This is why FiloDB is important.
>>>
>>> I agree there are more optimized stacks for running app servers, but
>>> the choices that you mentioned:  ES is targeted at text search;  Cass
>>> and HBase by themselves are not fast enough for analytical queries
>>> that the OP wants;  and MySQL is great but not scalable.   Probably
>>> something like VectorWise, HANA, Vertica would work well, but those
>>> are mostly not free solutions.   Druid could work too if the use case
>>> is right.
>>>
>>> Anyways, great discussion!
>>>
>>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly  wrote:
>>> > you are correct, mark.  i misspoke.  apologies for the confusion.
>>> >
>>> > so the problem is even worse given that a typical job requires multiple
>>> > tasks/cores.
>>> >
>>> > i have yet to see this particular architecture work in production.  i
>>> > would
>>> > love for someone to prove otherwise.
>>> >
>>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra 
>>> > wrote:
>>> >>>
>>> >>> For example, if you're looking to scale out to 1000 concurrent
>>> >>> requests,
>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>>> >>> with 1000
>>> >>> cores.
>>> >>
>>> >>
>>> >> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
>>> >> without any 1:1 correspondence between Worker cores and Jobs.  Cores
>>> >> are
>>> >> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at

Re: Can we use spark inside a web service?

2016-03-11 Thread Andrés Ivaldi
nice discussion , I've a question about  Web Service with Spark.

What Could be the problem using Akka-http as web service (Like play does )
, with one SparkContext created , and the queries over -http akka using
only the instance of  that SparkContext ,

Also about Analytics , we are working on real- time Analytics and as Hemant
said Spark is not a solution for low latency queries. What about using
Ingite for that?


On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat 
wrote:

> Spark-jobserver is an elegant product that builds concurrency on top of
> Spark. But, the current design of DAGScheduler prevents Spark to become a
> truly concurrent solution for low latency queries. DagScheduler will turn
> out to be a bottleneck for low latency queries. Sparrow project was an
> effort to make Spark more suitable for such scenarios but it never made it
> to the Spark codebase. If Spark has to become a highly concurrent solution,
> scheduling has to be distributed.
>
> Hemant Bhanawat 
> www.snappydata.io
>
> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly  wrote:
>
>> great discussion, indeed.
>>
>> Mark Hamstra and i spoke offline just now.
>>
>> Below is a quick recap of our discussion on how they've achieved
>> acceptable performance from Spark on the user request/response path (@mark-
>> feel free to correct/comment).
>>
>> 1) there is a big difference in request/response latency between
>> submitting a full Spark Application (heavy weight) versus having a
>> long-running Spark Application (like Spark Job Server) that submits
>> lighter-weight Jobs using a shared SparkContext.  mark is obviously using
>> the latter - a long-running Spark App.
>>
>> 2) there are some enhancements to Spark that are required to achieve
>> acceptable user request/response times.  some links that Mark provided are
>> as follows:
>>
>>- https://issues.apache.org/jira/browse/SPARK-11838
>>- https://github.com/apache/spark/pull/11036
>>- https://github.com/apache/spark/pull/11403
>>- https://issues.apache.org/jira/browse/SPARK-13523
>>- https://issues.apache.org/jira/browse/SPARK-13756
>>
>> Essentially, a deeper level of caching at the shuffle file layer to
>> reduce compute and memory between queries.
>>
>> Note that Mark is running a slightly-modified version of stock Spark.
>>  (He's mentioned this in prior posts, as well.)
>>
>> And I have to say that I'm, personally, seeing more and more
>> slightly-modified versions of Spark being deployed to production to
>> workaround outstanding PR's and Jiras.
>>
>> this may not be what people want to hear, but it's a trend that i'm
>> seeing lately as more and more customize Spark to their specific use cases.
>>
>> Anyway, thanks for the good discussion, everyone!  This is why we have
>> these lists, right!  :)
>>
>>
>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan 
>> wrote:
>>
>>> One of the premises here is that if you can restrict your workload to
>>> fewer cores - which is easier with FiloDB and careful data modeling -
>>> you can make this work for much higher concurrency and lower latency
>>> than most typical Spark use cases.
>>>
>>> The reason why it typically does not work in production is that most
>>> people are using HDFS and files.  These data sources are designed for
>>> running queries and workloads on all your cores across many workers,
>>> and not for filtering your workload down to only one or two cores.
>>>
>>> There is actually nothing inherent in Spark that prevents people from
>>> using it as an app server.   However, the insistence on using it with
>>> HDFS is what kills concurrency.   This is why FiloDB is important.
>>>
>>> I agree there are more optimized stacks for running app servers, but
>>> the choices that you mentioned:  ES is targeted at text search;  Cass
>>> and HBase by themselves are not fast enough for analytical queries
>>> that the OP wants;  and MySQL is great but not scalable.   Probably
>>> something like VectorWise, HANA, Vertica would work well, but those
>>> are mostly not free solutions.   Druid could work too if the use case
>>> is right.
>>>
>>> Anyways, great discussion!
>>>
>>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly  wrote:
>>> > you are correct, mark.  i misspoke.  apologies for the confusion.
>>> >
>>> > so the problem is even worse given that a typical job requires multiple
>>> > tasks/cores.
>>> >
>>> > i have yet to see this particular architecture work in production.  i
>>> would
>>> > love for someone to prove otherwise.
>>> >
>>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra >> >
>>> > wrote:
>>> >>>
>>> >>> For example, if you're looking to scale out to 1000 concurrent
>>> requests,
>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>>> with 1000
>>> >>> cores.
>>> >>
>>> >>
>>> >> This doesn't make sense.  A Spark Job is a 

Re: Can we use spark inside a web service?

2016-03-11 Thread Hemant Bhanawat
Spark-jobserver is an elegant product that builds concurrency on top of
Spark. But, the current design of DAGScheduler prevents Spark to become a
truly concurrent solution for low latency queries. DagScheduler will turn
out to be a bottleneck for low latency queries. Sparrow project was an
effort to make Spark more suitable for such scenarios but it never made it
to the Spark codebase. If Spark has to become a highly concurrent solution,
scheduling has to be distributed.

Hemant Bhanawat 
www.snappydata.io

On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly  wrote:

> great discussion, indeed.
>
> Mark Hamstra and i spoke offline just now.
>
> Below is a quick recap of our discussion on how they've achieved
> acceptable performance from Spark on the user request/response path (@mark-
> feel free to correct/comment).
>
> 1) there is a big difference in request/response latency between
> submitting a full Spark Application (heavy weight) versus having a
> long-running Spark Application (like Spark Job Server) that submits
> lighter-weight Jobs using a shared SparkContext.  mark is obviously using
> the latter - a long-running Spark App.
>
> 2) there are some enhancements to Spark that are required to achieve
> acceptable user request/response times.  some links that Mark provided are
> as follows:
>
>- https://issues.apache.org/jira/browse/SPARK-11838
>- https://github.com/apache/spark/pull/11036
>- https://github.com/apache/spark/pull/11403
>- https://issues.apache.org/jira/browse/SPARK-13523
>- https://issues.apache.org/jira/browse/SPARK-13756
>
> Essentially, a deeper level of caching at the shuffle file layer to reduce
> compute and memory between queries.
>
> Note that Mark is running a slightly-modified version of stock Spark.
>  (He's mentioned this in prior posts, as well.)
>
> And I have to say that I'm, personally, seeing more and more
> slightly-modified versions of Spark being deployed to production to
> workaround outstanding PR's and Jiras.
>
> this may not be what people want to hear, but it's a trend that i'm seeing
> lately as more and more customize Spark to their specific use cases.
>
> Anyway, thanks for the good discussion, everyone!  This is why we have
> these lists, right!  :)
>
>
> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan 
> wrote:
>
>> One of the premises here is that if you can restrict your workload to
>> fewer cores - which is easier with FiloDB and careful data modeling -
>> you can make this work for much higher concurrency and lower latency
>> than most typical Spark use cases.
>>
>> The reason why it typically does not work in production is that most
>> people are using HDFS and files.  These data sources are designed for
>> running queries and workloads on all your cores across many workers,
>> and not for filtering your workload down to only one or two cores.
>>
>> There is actually nothing inherent in Spark that prevents people from
>> using it as an app server.   However, the insistence on using it with
>> HDFS is what kills concurrency.   This is why FiloDB is important.
>>
>> I agree there are more optimized stacks for running app servers, but
>> the choices that you mentioned:  ES is targeted at text search;  Cass
>> and HBase by themselves are not fast enough for analytical queries
>> that the OP wants;  and MySQL is great but not scalable.   Probably
>> something like VectorWise, HANA, Vertica would work well, but those
>> are mostly not free solutions.   Druid could work too if the use case
>> is right.
>>
>> Anyways, great discussion!
>>
>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly  wrote:
>> > you are correct, mark.  i misspoke.  apologies for the confusion.
>> >
>> > so the problem is even worse given that a typical job requires multiple
>> > tasks/cores.
>> >
>> > i have yet to see this particular architecture work in production.  i
>> would
>> > love for someone to prove otherwise.
>> >
>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra 
>> > wrote:
>> >>>
>> >>> For example, if you're looking to scale out to 1000 concurrent
>> requests,
>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>> with 1000
>> >>> cores.
>> >>
>> >>
>> >> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
>> >> without any 1:1 correspondence between Worker cores and Jobs.  Cores
>> are
>> >> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at
>> most
>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything
>> about how
>> >> many Jobs are or can be concurrently tracked by the DAGScheduler,
>> which will
>> >> be apportioning the Tasks from those concurrent Jobs across the
>> available
>> >> Executor cores.
>> >>
>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly 
>> wrote:
>> >>>
>> >>> Good stuff, Evan.  Looks like this is utilizing 

Re: Can we use spark inside a web service?

2016-03-10 Thread Nick Pentreath
Yes, really interesting discussion.

It would be really interesting to compare the performance of alternative
architectures. Specifically, I've found that Elasticsearch is a great
option for analytic workloads - it doesn't support SQL (joins in
particular), but its aggregation and arbitrary filtering capabilities make
it very powerful, plus it does play really nicely with Spark, so Spark SQL
could be layered on top (though I've only done this for offline batch jobs,
not real-time user facing queries).

It also lends itself potentially very nicely to a "lambda-style"
architecture, i.e. querying across historical aggregated data and the
"real-time" component (current day, or hour, or whatever) at the same time,
with careful data modelling.

On Fri, 11 Mar 2016 at 06:25 Tristan Nixon  wrote:

> Hear, hear. That’s why I’m here :)
>
> On Mar 10, 2016, at 7:32 PM, Chris Fregly  wrote:
>
> Anyway, thanks for the good discussion, everyone!  This is why we have
> these lists, right!  :)
>
>
>


Re: Can we use spark inside a web service?

2016-03-10 Thread Tristan Nixon
Hear, hear. That’s why I’m here :)

> On Mar 10, 2016, at 7:32 PM, Chris Fregly  wrote:
> 
> Anyway, thanks for the good discussion, everyone!  This is why we have these 
> lists, right!  :)



Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
great discussion, indeed.

Mark Hamstra and i spoke offline just now.

Below is a quick recap of our discussion on how they've achieved acceptable
performance from Spark on the user request/response path (@mark- feel free
to correct/comment).

1) there is a big difference in request/response latency between submitting
a full Spark Application (heavy weight) versus having a long-running Spark
Application (like Spark Job Server) that submits lighter-weight Jobs using
a shared SparkContext.  mark is obviously using the latter - a long-running
Spark App.

2) there are some enhancements to Spark that are required to achieve
acceptable user request/response times.  some links that Mark provided are
as follows:

   - https://issues.apache.org/jira/browse/SPARK-11838
   - https://github.com/apache/spark/pull/11036
   - https://github.com/apache/spark/pull/11403
   - https://issues.apache.org/jira/browse/SPARK-13523
   - https://issues.apache.org/jira/browse/SPARK-13756

Essentially, a deeper level of caching at the shuffle file layer to reduce
compute and memory between queries.

Note that Mark is running a slightly-modified version of stock Spark.
 (He's mentioned this in prior posts, as well.)

And I have to say that I'm, personally, seeing more and more
slightly-modified versions of Spark being deployed to production to
workaround outstanding PR's and Jiras.

this may not be what people want to hear, but it's a trend that i'm seeing
lately as more and more customize Spark to their specific use cases.

Anyway, thanks for the good discussion, everyone!  This is why we have
these lists, right!  :)


On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan  wrote:

> One of the premises here is that if you can restrict your workload to
> fewer cores - which is easier with FiloDB and careful data modeling -
> you can make this work for much higher concurrency and lower latency
> than most typical Spark use cases.
>
> The reason why it typically does not work in production is that most
> people are using HDFS and files.  These data sources are designed for
> running queries and workloads on all your cores across many workers,
> and not for filtering your workload down to only one or two cores.
>
> There is actually nothing inherent in Spark that prevents people from
> using it as an app server.   However, the insistence on using it with
> HDFS is what kills concurrency.   This is why FiloDB is important.
>
> I agree there are more optimized stacks for running app servers, but
> the choices that you mentioned:  ES is targeted at text search;  Cass
> and HBase by themselves are not fast enough for analytical queries
> that the OP wants;  and MySQL is great but not scalable.   Probably
> something like VectorWise, HANA, Vertica would work well, but those
> are mostly not free solutions.   Druid could work too if the use case
> is right.
>
> Anyways, great discussion!
>
> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly  wrote:
> > you are correct, mark.  i misspoke.  apologies for the confusion.
> >
> > so the problem is even worse given that a typical job requires multiple
> > tasks/cores.
> >
> > i have yet to see this particular architecture work in production.  i
> would
> > love for someone to prove otherwise.
> >
> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra 
> > wrote:
> >>>
> >>> For example, if you're looking to scale out to 1000 concurrent
> requests,
> >>> this is 1000 concurrent Spark jobs.  This would require a cluster with
> 1000
> >>> cores.
> >>
> >>
> >> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
> >> without any 1:1 correspondence between Worker cores and Jobs.  Cores are
> >> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at
> most
> >> 1000 simultaneous Tasks, but that doesn't really tell you anything
> about how
> >> many Jobs are or can be concurrently tracked by the DAGScheduler, which
> will
> >> be apportioning the Tasks from those concurrent Jobs across the
> available
> >> Executor cores.
> >>
> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly  wrote:
> >>>
> >>> Good stuff, Evan.  Looks like this is utilizing the in-memory
> >>> capabilities of FiloDB which is pretty cool.  looking forward to the
> webcast
> >>> as I don't know much about FiloDB.
> >>>
> >>> My personal thoughts here are to removed Spark from the user
> >>> request/response hot path.
> >>>
> >>> I can't tell you how many times i've had to unroll that architecture at
> >>> clients - and replace with a real database like Cassandra,
> ElasticSearch,
> >>> HBase, MySql.
> >>>
> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
> >>> believe that Spark could be used as an application server.  This is
> not a
> >>> good use case for Spark.
> >>>
> >>> Remember that every job that is launched by Spark requires 1 CPU core,
> >>> some memory, and an available Executor JVM to provide the CPU 

Re: Can we use spark inside a web service?

2016-03-10 Thread Evan Chan
One of the premises here is that if you can restrict your workload to
fewer cores - which is easier with FiloDB and careful data modeling -
you can make this work for much higher concurrency and lower latency
than most typical Spark use cases.

The reason why it typically does not work in production is that most
people are using HDFS and files.  These data sources are designed for
running queries and workloads on all your cores across many workers,
and not for filtering your workload down to only one or two cores.

There is actually nothing inherent in Spark that prevents people from
using it as an app server.   However, the insistence on using it with
HDFS is what kills concurrency.   This is why FiloDB is important.

I agree there are more optimized stacks for running app servers, but
the choices that you mentioned:  ES is targeted at text search;  Cass
and HBase by themselves are not fast enough for analytical queries
that the OP wants;  and MySQL is great but not scalable.   Probably
something like VectorWise, HANA, Vertica would work well, but those
are mostly not free solutions.   Druid could work too if the use case
is right.

Anyways, great discussion!

On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly  wrote:
> you are correct, mark.  i misspoke.  apologies for the confusion.
>
> so the problem is even worse given that a typical job requires multiple
> tasks/cores.
>
> i have yet to see this particular architecture work in production.  i would
> love for someone to prove otherwise.
>
> On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra 
> wrote:
>>>
>>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>>> cores.
>>
>>
>> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
>> without any 1:1 correspondence between Worker cores and Jobs.  Cores are
>> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at most
>> 1000 simultaneous Tasks, but that doesn't really tell you anything about how
>> many Jobs are or can be concurrently tracked by the DAGScheduler, which will
>> be apportioning the Tasks from those concurrent Jobs across the available
>> Executor cores.
>>
>> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly  wrote:
>>>
>>> Good stuff, Evan.  Looks like this is utilizing the in-memory
>>> capabilities of FiloDB which is pretty cool.  looking forward to the webcast
>>> as I don't know much about FiloDB.
>>>
>>> My personal thoughts here are to removed Spark from the user
>>> request/response hot path.
>>>
>>> I can't tell you how many times i've had to unroll that architecture at
>>> clients - and replace with a real database like Cassandra, ElasticSearch,
>>> HBase, MySql.
>>>
>>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
>>> believe that Spark could be used as an application server.  This is not a
>>> good use case for Spark.
>>>
>>> Remember that every job that is launched by Spark requires 1 CPU core,
>>> some memory, and an available Executor JVM to provide the CPU and memory.
>>>
>>> Yes, you can horizontally scale this because of the distributed nature of
>>> Spark, however it is not an efficient scaling strategy.
>>>
>>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>>> cores.  this is just not cost effective.
>>>
>>> Use Spark for what it's good for - ad-hoc, interactive, and iterative
>>> (machine learning, graph) analytics.  Use an application server for what
>>> it's good - managing a large amount of concurrent requests.  And use a
>>> database for what it's good for - storing/retrieving data.
>>>
>>> And any serious production deployment will need failover, throttling,
>>> back pressure, auto-scaling, and service discovery.
>>>
>>> While Spark supports these to varying levels of production-readiness,
>>> Spark is a batch-oriented system and not meant to be put on the user
>>> request/response hot path.
>>>
>>> For the failover, throttling, back pressure, autoscaling that i mentioned
>>> above, it's worth checking out the suite of Netflix OSS - particularly
>>> Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
>>>
>>> Here's my github project that incorporates a lot of these:
>>> https://github.com/cfregly/fluxcapacitor
>>>
>>> Here's a netflix Skunkworks github project that packages these up in
>>> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>>>
>>>
>>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github 
>>> wrote:

 Hi,

 I just wrote a blog post which might be really useful to you -- I have
 just
 benchmarked being able to achieve 700 queries per second in Spark.  So,
 yes,
 web speed SQL queries are definitely possible.   Read my new blog post:

 

Re: Can we use spark inside a web service?

2016-03-10 Thread Teng Qiu
This is really depends on how you defined "hot" :) and use cases, spark is
definitely not that one fits all. At least not yet. Specially for heavy
joins and full scans.

Maybe spark alone fits your production workload and analytical
requirements, but in general, I agree with Chris, for high concurrency,
multi-tenants scenario, there are many existing better solutions.

Am Donnerstag, 10. März 2016 schrieb Mark Hamstra :
> The fact that a typical Job requires multiple Tasks is not a problem, but
rather an opportunity for the Scheduler to interleave the workloads of
multiple concurrent Jobs across the available cores.
> I work every day with such a production architecture with Spark on the
user request/response hot path.
> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly  wrote:
>>
>> you are correct, mark.  i misspoke.  apologies for the confusion.
>> so the problem is even worse given that a typical job requires multiple
tasks/cores.
>> i have yet to see this particular architecture work in production.  i
would love for someone to prove otherwise.
>> On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra 
wrote:

 For example, if you're looking to scale out to 1000 concurrent
requests, this is 1000 concurrent Spark jobs.  This would require a cluster
with 1000 cores.
>>>
>>> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
without any 1:1 correspondence between Worker cores and Jobs.  Cores are
used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at most
1000 simultaneous Tasks, but that doesn't really tell you anything about
how many Jobs are or can be concurrently tracked by the DAGScheduler, which
will be apportioning the Tasks from those concurrent Jobs across the
available Executor cores.
>>> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly  wrote:

 Good stuff, Evan.  Looks like this is utilizing the in-memory
capabilities of FiloDB which is pretty cool.  looking forward to the
webcast as I don't know much about FiloDB.
 My personal thoughts here are to removed Spark from the user
request/response hot path.
 I can't tell you how many times i've had to unroll that architecture
at clients - and replace with a real database like Cassandra,
ElasticSearch, HBase, MySql.
 Unfortunately, Spark - and Spark Streaming, especially - lead you to
believe that Spark could be used as an application server.  This is not a
good use case for Spark.
 Remember that every job that is launched by Spark requires 1 CPU core,
some memory, and an available Executor JVM to provide the CPU and memory.
 Yes, you can horizontally scale this because of the distributed nature
of Spark, however it is not an efficient scaling strategy.
 For example, if you're looking to scale out to 1000 concurrent
requests, this is 1000 concurrent Spark jobs.  This would require a cluster
with 1000 cores.  this is just not cost effective.
 Use Spark for what it's good for - ad-hoc, interactive, and iterative
(machine learning, graph) analytics.  Use an application server for what
it's good - managing a large amount of concurrent requests.  And use a
database for what it's good for - storing/retrieving data.
 And any serious production deployment will need failover, throttling,
back pressure, auto-scaling, and service discovery.
 While Spark supports these to varying levels of production-readiness,
Spark is a batch-oriented system and not meant to be put on the user
request/response hot path.
 For the failover, throttling, back pressure, autoscaling that i
mentioned above, it's worth checking out the suite of Netflix OSS -
particularly Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
 Here's my github project that incorporates a lot of these:
https://github.com/cfregly/fluxcapacitor
 Here's a netflix Skunkworks github project that packages these up in
Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker

 On Thu, Mar 10, 2016 at 1:40 PM, velvia.github 
wrote:
>
> Hi,
>
> I just wrote a blog post which might be really useful to you -- I
have just
> benchmarked being able to achieve 700 queries per second in Spark.
So, yes,
> web speed SQL queries are definitely possible.   Read my new blog
post:
>
> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>
> and feel free to email me (at vel...@gmail.com) if you would like to
follow
> up.
>
> -Evan
>
>
>
>
> --
> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
> Sent from the Apache Spark User List mailing list archive at
Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: 

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
The fact that a typical Job requires multiple Tasks is not a problem, but
rather an opportunity for the Scheduler to interleave the workloads of
multiple concurrent Jobs across the available cores.

I work every day with such a production architecture with Spark on the user
request/response hot path.

On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly  wrote:

> you are correct, mark.  i misspoke.  apologies for the confusion.
>
> so the problem is even worse given that a typical job requires multiple
> tasks/cores.
>
> i have yet to see this particular architecture work in production.  i
> would love for someone to prove otherwise.
>
> On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra 
> wrote:
>
>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>>> cores.
>>
>>
>> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
>> without any 1:1 correspondence between Worker cores and Jobs.  Cores are
>> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at most
>> 1000 simultaneous Tasks, but that doesn't really tell you anything about
>> how many Jobs are or can be concurrently tracked by the DAGScheduler, which
>> will be apportioning the Tasks from those concurrent Jobs across the
>> available Executor cores.
>>
>> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly  wrote:
>>
>>> Good stuff, Evan.  Looks like this is utilizing the in-memory
>>> capabilities of FiloDB which is pretty cool.  looking forward to the
>>> webcast as I don't know much about FiloDB.
>>>
>>> My personal thoughts here are to removed Spark from the user
>>> request/response hot path.
>>>
>>> I can't tell you how many times i've had to unroll that architecture at
>>> clients - and replace with a real database like Cassandra, ElasticSearch,
>>> HBase, MySql.
>>>
>>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
>>> believe that Spark could be used as an application server.  This is not a
>>> good use case for Spark.
>>>
>>> Remember that every job that is launched by Spark requires 1 CPU core,
>>> some memory, and an available Executor JVM to provide the CPU and memory.
>>>
>>> Yes, you can horizontally scale this because of the distributed nature
>>> of Spark, however it is not an efficient scaling strategy.
>>>
>>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>>> cores.  this is just not cost effective.
>>>
>>> Use Spark for what it's good for - ad-hoc, interactive, and iterative
>>> (machine learning, graph) analytics.  Use an application server for what
>>> it's good - managing a large amount of concurrent requests.  And use a
>>> database for what it's good for - storing/retrieving data.
>>>
>>> And any serious production deployment will need failover, throttling,
>>> back pressure, auto-scaling, and service discovery.
>>>
>>> While Spark supports these to varying levels of production-readiness,
>>> Spark is a batch-oriented system and not meant to be put on the user
>>> request/response hot path.
>>>
>>> For the failover, throttling, back pressure, autoscaling that i
>>> mentioned above, it's worth checking out the suite of Netflix OSS -
>>> particularly Hystrix, Eureka, Zuul, Karyon, etc:
>>> http://netflix.github.io/
>>>
>>> Here's my github project that incorporates a lot of these:
>>> https://github.com/cfregly/fluxcapacitor
>>>
>>> Here's a netflix Skunkworks github project that packages these up in
>>> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>>>
>>>
>>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github 
>>> wrote:
>>>
 Hi,

 I just wrote a blog post which might be really useful to you -- I have
 just
 benchmarked being able to achieve 700 queries per second in Spark.  So,
 yes,
 web speed SQL queries are definitely possible.   Read my new blog post:

 http://velvia.github.io/Spark-Concurrent-Fast-Queries/

 and feel free to email me (at vel...@gmail.com) if you would like to
 follow
 up.

 -Evan




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>>
>>> --
>>>
>>> *Chris Fregly*
>>> Principal Data Solutions Engineer
>>> IBM Spark Technology Center, San Francisco, CA
>>> http://spark.tc | http://advancedspark.com
>>>
>>
>>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM 

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
you are correct, mark.  i misspoke.  apologies for the confusion.

so the problem is even worse given that a typical job requires multiple
tasks/cores.

i have yet to see this particular architecture work in production.  i would
love for someone to prove otherwise.

On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra 
wrote:

> For example, if you're looking to scale out to 1000 concurrent requests,
>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>> cores.
>
>
> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
> without any 1:1 correspondence between Worker cores and Jobs.  Cores are
> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at most
> 1000 simultaneous Tasks, but that doesn't really tell you anything about
> how many Jobs are or can be concurrently tracked by the DAGScheduler, which
> will be apportioning the Tasks from those concurrent Jobs across the
> available Executor cores.
>
> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly  wrote:
>
>> Good stuff, Evan.  Looks like this is utilizing the in-memory
>> capabilities of FiloDB which is pretty cool.  looking forward to the
>> webcast as I don't know much about FiloDB.
>>
>> My personal thoughts here are to removed Spark from the user
>> request/response hot path.
>>
>> I can't tell you how many times i've had to unroll that architecture at
>> clients - and replace with a real database like Cassandra, ElasticSearch,
>> HBase, MySql.
>>
>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
>> believe that Spark could be used as an application server.  This is not a
>> good use case for Spark.
>>
>> Remember that every job that is launched by Spark requires 1 CPU core,
>> some memory, and an available Executor JVM to provide the CPU and memory.
>>
>> Yes, you can horizontally scale this because of the distributed nature of
>> Spark, however it is not an efficient scaling strategy.
>>
>> For example, if you're looking to scale out to 1000 concurrent requests,
>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>> cores.  this is just not cost effective.
>>
>> Use Spark for what it's good for - ad-hoc, interactive, and iterative
>> (machine learning, graph) analytics.  Use an application server for what
>> it's good - managing a large amount of concurrent requests.  And use a
>> database for what it's good for - storing/retrieving data.
>>
>> And any serious production deployment will need failover, throttling,
>> back pressure, auto-scaling, and service discovery.
>>
>> While Spark supports these to varying levels of production-readiness,
>> Spark is a batch-oriented system and not meant to be put on the user
>> request/response hot path.
>>
>> For the failover, throttling, back pressure, autoscaling that i mentioned
>> above, it's worth checking out the suite of Netflix OSS - particularly
>> Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
>>
>> Here's my github project that incorporates a lot of these:
>> https://github.com/cfregly/fluxcapacitor
>>
>> Here's a netflix Skunkworks github project that packages these up in
>> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>>
>>
>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github 
>> wrote:
>>
>>> Hi,
>>>
>>> I just wrote a blog post which might be really useful to you -- I have
>>> just
>>> benchmarked being able to achieve 700 queries per second in Spark.  So,
>>> yes,
>>> web speed SQL queries are definitely possible.   Read my new blog post:
>>>
>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>>>
>>> and feel free to email me (at vel...@gmail.com) if you would like to
>>> follow
>>> up.
>>>
>>> -Evan
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>>
>> *Chris Fregly*
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>>
>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
>
> For example, if you're looking to scale out to 1000 concurrent requests,
> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
> cores.


This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
without any 1:1 correspondence between Worker cores and Jobs.  Cores are
used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at most
1000 simultaneous Tasks, but that doesn't really tell you anything about
how many Jobs are or can be concurrently tracked by the DAGScheduler, which
will be apportioning the Tasks from those concurrent Jobs across the
available Executor cores.

On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly  wrote:

> Good stuff, Evan.  Looks like this is utilizing the in-memory capabilities
> of FiloDB which is pretty cool.  looking forward to the webcast as I don't
> know much about FiloDB.
>
> My personal thoughts here are to removed Spark from the user
> request/response hot path.
>
> I can't tell you how many times i've had to unroll that architecture at
> clients - and replace with a real database like Cassandra, ElasticSearch,
> HBase, MySql.
>
> Unfortunately, Spark - and Spark Streaming, especially - lead you to
> believe that Spark could be used as an application server.  This is not a
> good use case for Spark.
>
> Remember that every job that is launched by Spark requires 1 CPU core,
> some memory, and an available Executor JVM to provide the CPU and memory.
>
> Yes, you can horizontally scale this because of the distributed nature of
> Spark, however it is not an efficient scaling strategy.
>
> For example, if you're looking to scale out to 1000 concurrent requests,
> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
> cores.  this is just not cost effective.
>
> Use Spark for what it's good for - ad-hoc, interactive, and iterative
> (machine learning, graph) analytics.  Use an application server for what
> it's good - managing a large amount of concurrent requests.  And use a
> database for what it's good for - storing/retrieving data.
>
> And any serious production deployment will need failover, throttling, back
> pressure, auto-scaling, and service discovery.
>
> While Spark supports these to varying levels of production-readiness,
> Spark is a batch-oriented system and not meant to be put on the user
> request/response hot path.
>
> For the failover, throttling, back pressure, autoscaling that i mentioned
> above, it's worth checking out the suite of Netflix OSS - particularly
> Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
>
> Here's my github project that incorporates a lot of these:
> https://github.com/cfregly/fluxcapacitor
>
> Here's a netflix Skunkworks github project that packages these up in
> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>
>
> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github 
> wrote:
>
>> Hi,
>>
>> I just wrote a blog post which might be really useful to you -- I have
>> just
>> benchmarked being able to achieve 700 queries per second in Spark.  So,
>> yes,
>> web speed SQL queries are definitely possible.   Read my new blog post:
>>
>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>>
>> and feel free to email me (at vel...@gmail.com) if you would like to
>> follow
>> up.
>>
>> -Evan
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>


Re: Can we use spark inside a web service?

2016-03-10 Thread Tristan Nixon
Very interested, Evan, thanks for the link. It has given me some food for 
thought.

I’m also in the process of building a web application which leverage Spark on 
the back-end for some heavy lifting. I would be curious about your thoughts on 
my proposed architecture:
I was planning on running a spark-streaming app which listens for incoming 
messages on a dedicated queue, and then returns them on a separate one. The 
RESTful web service would handle incoming requests by putting an appropriate 
message on the input queue, and then listen for a response on the output queue, 
transforming the output message into an appropriate HTTP response. How do you 
think this will fair vs. interacting with the spark job service? I was hoping 
that I could minimize the time to launch spark jobs by keeping a streaming app 
running in the background.

> On Mar 10, 2016, at 12:40 PM, velvia.github  wrote:
> 
> Hi,
> 
> I just wrote a blog post which might be really useful to you -- I have just
> benchmarked being able to achieve 700 queries per second in Spark.  So, yes,
> web speed SQL queries are definitely possible.   Read my new blog post:
> 
> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
> 
> and feel free to email me (at vel...@gmail.com) if you would like to follow
> up.
> 
> -Evan
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
Good stuff, Evan.  Looks like this is utilizing the in-memory capabilities
of FiloDB which is pretty cool.  looking forward to the webcast as I don't
know much about FiloDB.

My personal thoughts here are to removed Spark from the user
request/response hot path.

I can't tell you how many times i've had to unroll that architecture at
clients - and replace with a real database like Cassandra, ElasticSearch,
HBase, MySql.

Unfortunately, Spark - and Spark Streaming, especially - lead you to
believe that Spark could be used as an application server.  This is not a
good use case for Spark.

Remember that every job that is launched by Spark requires 1 CPU core, some
memory, and an available Executor JVM to provide the CPU and memory.

Yes, you can horizontally scale this because of the distributed nature of
Spark, however it is not an efficient scaling strategy.

For example, if you're looking to scale out to 1000 concurrent requests,
this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
cores.  this is just not cost effective.

Use Spark for what it's good for - ad-hoc, interactive, and iterative
(machine learning, graph) analytics.  Use an application server for what
it's good - managing a large amount of concurrent requests.  And use a
database for what it's good for - storing/retrieving data.

And any serious production deployment will need failover, throttling, back
pressure, auto-scaling, and service discovery.

While Spark supports these to varying levels of production-readiness, Spark
is a batch-oriented system and not meant to be put on the user
request/response hot path.

For the failover, throttling, back pressure, autoscaling that i mentioned
above, it's worth checking out the suite of Netflix OSS - particularly
Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/

Here's my github project that incorporates a lot of these:
https://github.com/cfregly/fluxcapacitor

Here's a netflix Skunkworks github project that packages these up in Docker
images:  https://github.com/Netflix-Skunkworks/zerotodocker


On Thu, Mar 10, 2016 at 1:40 PM, velvia.github 
wrote:

> Hi,
>
> I just wrote a blog post which might be really useful to you -- I have just
> benchmarked being able to achieve 700 queries per second in Spark.  So,
> yes,
> web speed SQL queries are definitely possible.   Read my new blog post:
>
> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>
> and feel free to email me (at vel...@gmail.com) if you would like to
> follow
> up.
>
> -Evan
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


Re: Can we use spark inside a web service?

2016-03-10 Thread velvia.github
Hi,

I just wrote a blog post which might be really useful to you -- I have just
benchmarked being able to achieve 700 queries per second in Spark.  So, yes,
web speed SQL queries are definitely possible.   Read my new blog post:

http://velvia.github.io/Spark-Concurrent-Fast-Queries/

and feel free to email me (at vel...@gmail.com) if you would like to follow
up.

-Evan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org