Re: Can we use spark inside a web service?

Evan Chan Mon, 14 Mar 2016 15:16:33 -0700

At least for simple queries, the DAGScheduler does not appear to be
the bottleneck - since we are able to schedule 700 queries, and all
the scheduling is probably done from the main application thread.


However, I did have high hopes for Sparrow.  What was the reason they
decided not to include that?

On Fri, Mar 11, 2016 at 1:52 AM, Hemant Bhanawat <hemant9...@gmail.com> wrote:
> Spark-jobserver is an elegant product that builds concurrency on top of
> Spark. But, the current design of DAGScheduler prevents Spark to become a
> truly concurrent solution for low latency queries. DagScheduler will turn
> out to be a bottleneck for low latency queries. Sparrow project was an
> effort to make Spark more suitable for such scenarios but it never made it
> to the Spark codebase. If Spark has to become a highly concurrent solution,
> scheduling has to be distributed.
>
> Hemant Bhanawat
> www.snappydata.io
>
> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <ch...@fregly.com> wrote:
>>
>> great discussion, indeed.
>>
>> Mark Hamstra and i spoke offline just now.
>>
>> Below is a quick recap of our discussion on how they've achieved
>> acceptable performance from Spark on the user request/response path (@mark-
>> feel free to correct/comment).
>>
>> 1) there is a big difference in request/response latency between
>> submitting a full Spark Application (heavy weight) versus having a
>> long-running Spark Application (like Spark Job Server) that submits
>> lighter-weight Jobs using a shared SparkContext.  mark is obviously using
>> the latter - a long-running Spark App.
>>
>> 2) there are some enhancements to Spark that are required to achieve
>> acceptable user request/response times.  some links that Mark provided are
>> as follows:
>>
>> https://issues.apache.org/jira/browse/SPARK-11838
>> https://github.com/apache/spark/pull/11036
>> https://github.com/apache/spark/pull/11403
>> https://issues.apache.org/jira/browse/SPARK-13523
>> https://issues.apache.org/jira/browse/SPARK-13756
>>
>> Essentially, a deeper level of caching at the shuffle file layer to reduce
>> compute and memory between queries.
>>
>> Note that Mark is running a slightly-modified version of stock Spark.
>> (He's mentioned this in prior posts, as well.)
>>
>> And I have to say that I'm, personally, seeing more and more
>> slightly-modified versions of Spark being deployed to production to
>> workaround outstanding PR's and Jiras.
>>
>> this may not be what people want to hear, but it's a trend that i'm seeing
>> lately as more and more customize Spark to their specific use cases.
>>
>> Anyway, thanks for the good discussion, everyone!  This is why we have
>> these lists, right!  :)
>>
>>
>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan <velvia.git...@gmail.com>
>> wrote:
>>>
>>> One of the premises here is that if you can restrict your workload to
>>> fewer cores - which is easier with FiloDB and careful data modeling -
>>> you can make this work for much higher concurrency and lower latency
>>> than most typical Spark use cases.
>>>
>>> The reason why it typically does not work in production is that most
>>> people are using HDFS and files.  These data sources are designed for
>>> running queries and workloads on all your cores across many workers,
>>> and not for filtering your workload down to only one or two cores.
>>>
>>> There is actually nothing inherent in Spark that prevents people from
>>> using it as an app server.   However, the insistence on using it with
>>> HDFS is what kills concurrency.   This is why FiloDB is important.
>>>
>>> I agree there are more optimized stacks for running app servers, but
>>> the choices that you mentioned:  ES is targeted at text search;  Cass
>>> and HBase by themselves are not fast enough for analytical queries
>>> that the OP wants;  and MySQL is great but not scalable.   Probably
>>> something like VectorWise, HANA, Vertica would work well, but those
>>> are mostly not free solutions.   Druid could work too if the use case
>>> is right.
>>>
>>> Anyways, great discussion!
>>>
>>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com> wrote:
>>> > you are correct, mark.  i misspoke.  apologies for the confusion.
>>> >
>>> > so the problem is even worse given that a typical job requires multiple
>>> > tasks/cores.
>>> >
>>> > i have yet to see this particular architecture work in production.  i
>>> > would
>>> > love for someone to prove otherwise.
>>> >
>>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <m...@clearstorydata.com>
>>> > wrote:
>>> >>>
>>> >>> For example, if you're looking to scale out to 1000 concurrent
>>> >>> requests,
>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>>> >>> with 1000
>>> >>> cores.
>>> >>
>>> >>
>>> >> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
>>> >> without any 1:1 correspondence between Worker cores and Jobs.  Cores
>>> >> are
>>> >> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at
>>> >> most
>>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything
>>> >> about how
>>> >> many Jobs are or can be concurrently tracked by the DAGScheduler,
>>> >> which will
>>> >> be apportioning the Tasks from those concurrent Jobs across the
>>> >> available
>>> >> Executor cores.
>>> >>
>>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com>
>>> >> wrote:
>>> >>>
>>> >>> Good stuff, Evan.  Looks like this is utilizing the in-memory
>>> >>> capabilities of FiloDB which is pretty cool.  looking forward to the
>>> >>> webcast
>>> >>> as I don't know much about FiloDB.
>>> >>>
>>> >>> My personal thoughts here are to removed Spark from the user
>>> >>> request/response hot path.
>>> >>>
>>> >>> I can't tell you how many times i've had to unroll that architecture
>>> >>> at
>>> >>> clients - and replace with a real database like Cassandra,
>>> >>> ElasticSearch,
>>> >>> HBase, MySql.
>>> >>>
>>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
>>> >>> believe that Spark could be used as an application server.  This is
>>> >>> not a
>>> >>> good use case for Spark.
>>> >>>
>>> >>> Remember that every job that is launched by Spark requires 1 CPU
>>> >>> core,
>>> >>> some memory, and an available Executor JVM to provide the CPU and
>>> >>> memory.
>>> >>>
>>> >>> Yes, you can horizontally scale this because of the distributed
>>> >>> nature of
>>> >>> Spark, however it is not an efficient scaling strategy.
>>> >>>
>>> >>> For example, if you're looking to scale out to 1000 concurrent
>>> >>> requests,
>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>>> >>> with 1000
>>> >>> cores.  this is just not cost effective.
>>> >>>
>>> >>> Use Spark for what it's good for - ad-hoc, interactive, and iterative
>>> >>> (machine learning, graph) analytics.  Use an application server for
>>> >>> what
>>> >>> it's good - managing a large amount of concurrent requests.  And use
>>> >>> a
>>> >>> database for what it's good for - storing/retrieving data.
>>> >>>
>>> >>> And any serious production deployment will need failover, throttling,
>>> >>> back pressure, auto-scaling, and service discovery.
>>> >>>
>>> >>> While Spark supports these to varying levels of production-readiness,
>>> >>> Spark is a batch-oriented system and not meant to be put on the user
>>> >>> request/response hot path.
>>> >>>
>>> >>> For the failover, throttling, back pressure, autoscaling that i
>>> >>> mentioned
>>> >>> above, it's worth checking out the suite of Netflix OSS -
>>> >>> particularly
>>> >>> Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
>>> >>>
>>> >>> Here's my github project that incorporates a lot of these:
>>> >>> https://github.com/cfregly/fluxcapacitor
>>> >>>
>>> >>> Here's a netflix Skunkworks github project that packages these up in
>>> >>> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>>> >>>
>>> >>>
>>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github
>>> >>> <velvia.git...@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> I just wrote a blog post which might be really useful to you -- I
>>> >>>> have
>>> >>>> just
>>> >>>> benchmarked being able to achieve 700 queries per second in Spark.
>>> >>>> So,
>>> >>>> yes,
>>> >>>> web speed SQL queries are definitely possible.   Read my new blog
>>> >>>> post:
>>> >>>>
>>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>>> >>>>
>>> >>>> and feel free to email me (at vel...@gmail.com) if you would like to
>>> >>>> follow
>>> >>>> up.
>>> >>>>
>>> >>>> -Evan
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> View this message in context:
>>> >>>>
>>> >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
>>> >>>> Sent from the Apache Spark User List mailing list archive at
>>> >>>> Nabble.com.
>>> >>>>
>>> >>>>
>>> >>>> ---------------------------------------------------------------------
>>> >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >>>> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>>
>>> >>> Chris Fregly
>>> >>> Principal Data Solutions Engineer
>>> >>> IBM Spark Technology Center, San Francisco, CA
>>> >>> http://spark.tc | http://advancedspark.com
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > Chris Fregly
>>> > Principal Data Solutions Engineer
>>> > IBM Spark Technology Center, San Francisco, CA
>>> > http://spark.tc | http://advancedspark.com
>>
>>
>>
>>
>> --
>>
>> Chris Fregly
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Can we use spark inside a web service?

Reply via email to