Spark-jobserver is an elegant product that builds concurrency on top of
Spark. But, the current design of DAGScheduler prevents Spark to become a
truly concurrent solution for low latency queries. DagScheduler will turn
out to be a bottleneck for low latency queries. Sparrow project was an
effort to make Spark more suitable for such scenarios but it never made it
to the Spark codebase. If Spark has to become a highly concurrent solution,
scheduling has to be distributed.

Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io

On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <ch...@fregly.com> wrote:

> great discussion, indeed.
>
> Mark Hamstra and i spoke offline just now.
>
> Below is a quick recap of our discussion on how they've achieved
> acceptable performance from Spark on the user request/response path (@mark-
> feel free to correct/comment).
>
> 1) there is a big difference in request/response latency between
> submitting a full Spark Application (heavy weight) versus having a
> long-running Spark Application (like Spark Job Server) that submits
> lighter-weight Jobs using a shared SparkContext.  mark is obviously using
> the latter - a long-running Spark App.
>
> 2) there are some enhancements to Spark that are required to achieve
> acceptable user request/response times.  some links that Mark provided are
> as follows:
>
>    - https://issues.apache.org/jira/browse/SPARK-11838
>    - https://github.com/apache/spark/pull/11036
>    - https://github.com/apache/spark/pull/11403
>    - https://issues.apache.org/jira/browse/SPARK-13523
>    - https://issues.apache.org/jira/browse/SPARK-13756
>
> Essentially, a deeper level of caching at the shuffle file layer to reduce
> compute and memory between queries.
>
> Note that Mark is running a slightly-modified version of stock Spark.
>  (He's mentioned this in prior posts, as well.)
>
> And I have to say that I'm, personally, seeing more and more
> slightly-modified versions of Spark being deployed to production to
> workaround outstanding PR's and Jiras.
>
> this may not be what people want to hear, but it's a trend that i'm seeing
> lately as more and more customize Spark to their specific use cases.
>
> Anyway, thanks for the good discussion, everyone!  This is why we have
> these lists, right!  :)
>
>
> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan <velvia.git...@gmail.com>
> wrote:
>
>> One of the premises here is that if you can restrict your workload to
>> fewer cores - which is easier with FiloDB and careful data modeling -
>> you can make this work for much higher concurrency and lower latency
>> than most typical Spark use cases.
>>
>> The reason why it typically does not work in production is that most
>> people are using HDFS and files.  These data sources are designed for
>> running queries and workloads on all your cores across many workers,
>> and not for filtering your workload down to only one or two cores.
>>
>> There is actually nothing inherent in Spark that prevents people from
>> using it as an app server.   However, the insistence on using it with
>> HDFS is what kills concurrency.   This is why FiloDB is important.
>>
>> I agree there are more optimized stacks for running app servers, but
>> the choices that you mentioned:  ES is targeted at text search;  Cass
>> and HBase by themselves are not fast enough for analytical queries
>> that the OP wants;  and MySQL is great but not scalable.   Probably
>> something like VectorWise, HANA, Vertica would work well, but those
>> are mostly not free solutions.   Druid could work too if the use case
>> is right.
>>
>> Anyways, great discussion!
>>
>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com> wrote:
>> > you are correct, mark.  i misspoke.  apologies for the confusion.
>> >
>> > so the problem is even worse given that a typical job requires multiple
>> > tasks/cores.
>> >
>> > i have yet to see this particular architecture work in production.  i
>> would
>> > love for someone to prove otherwise.
>> >
>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <m...@clearstorydata.com>
>> > wrote:
>> >>>
>> >>> For example, if you're looking to scale out to 1000 concurrent
>> requests,
>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>> with 1000
>> >>> cores.
>> >>
>> >>
>> >> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
>> >> without any 1:1 correspondence between Worker cores and Jobs.  Cores
>> are
>> >> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at
>> most
>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything
>> about how
>> >> many Jobs are or can be concurrently tracked by the DAGScheduler,
>> which will
>> >> be apportioning the Tasks from those concurrent Jobs across the
>> available
>> >> Executor cores.
>> >>
>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com>
>> wrote:
>> >>>
>> >>> Good stuff, Evan.  Looks like this is utilizing the in-memory
>> >>> capabilities of FiloDB which is pretty cool.  looking forward to the
>> webcast
>> >>> as I don't know much about FiloDB.
>> >>>
>> >>> My personal thoughts here are to removed Spark from the user
>> >>> request/response hot path.
>> >>>
>> >>> I can't tell you how many times i've had to unroll that architecture
>> at
>> >>> clients - and replace with a real database like Cassandra,
>> ElasticSearch,
>> >>> HBase, MySql.
>> >>>
>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
>> >>> believe that Spark could be used as an application server.  This is
>> not a
>> >>> good use case for Spark.
>> >>>
>> >>> Remember that every job that is launched by Spark requires 1 CPU core,
>> >>> some memory, and an available Executor JVM to provide the CPU and
>> memory.
>> >>>
>> >>> Yes, you can horizontally scale this because of the distributed
>> nature of
>> >>> Spark, however it is not an efficient scaling strategy.
>> >>>
>> >>> For example, if you're looking to scale out to 1000 concurrent
>> requests,
>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>> with 1000
>> >>> cores.  this is just not cost effective.
>> >>>
>> >>> Use Spark for what it's good for - ad-hoc, interactive, and iterative
>> >>> (machine learning, graph) analytics.  Use an application server for
>> what
>> >>> it's good - managing a large amount of concurrent requests.  And use a
>> >>> database for what it's good for - storing/retrieving data.
>> >>>
>> >>> And any serious production deployment will need failover, throttling,
>> >>> back pressure, auto-scaling, and service discovery.
>> >>>
>> >>> While Spark supports these to varying levels of production-readiness,
>> >>> Spark is a batch-oriented system and not meant to be put on the user
>> >>> request/response hot path.
>> >>>
>> >>> For the failover, throttling, back pressure, autoscaling that i
>> mentioned
>> >>> above, it's worth checking out the suite of Netflix OSS - particularly
>> >>> Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
>> >>>
>> >>> Here's my github project that incorporates a lot of these:
>> >>> https://github.com/cfregly/fluxcapacitor
>> >>>
>> >>> Here's a netflix Skunkworks github project that packages these up in
>> >>> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>> >>>
>> >>>
>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github <
>> velvia.git...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I just wrote a blog post which might be really useful to you -- I
>> have
>> >>>> just
>> >>>> benchmarked being able to achieve 700 queries per second in Spark.
>> So,
>> >>>> yes,
>> >>>> web speed SQL queries are definitely possible.   Read my new blog
>> post:
>> >>>>
>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>> >>>>
>> >>>> and feel free to email me (at vel...@gmail.com) if you would like to
>> >>>> follow
>> >>>> up.
>> >>>>
>> >>>> -Evan
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> View this message in context:
>> >>>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
>> >>>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>>
>> >>> Chris Fregly
>> >>> Principal Data Solutions Engineer
>> >>> IBM Spark Technology Center, San Francisco, CA
>> >>> http://spark.tc | http://advancedspark.com
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> >
>> > Chris Fregly
>> > Principal Data Solutions Engineer
>> > IBM Spark Technology Center, San Francisco, CA
>> > http://spark.tc | http://advancedspark.com
>>
>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>

Reply via email to