The fact that a typical Job requires multiple Tasks is not a problem, but
rather an opportunity for the Scheduler to interleave the workloads of
multiple concurrent Jobs across the available cores.

I work every day with such a production architecture with Spark on the user
request/response hot path.

On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com> wrote:

> you are correct, mark.  i misspoke.  apologies for the confusion.
>
> so the problem is even worse given that a typical job requires multiple
> tasks/cores.
>
> i have yet to see this particular architecture work in production.  i
> would love for someone to prove otherwise.
>
> On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>>> cores.
>>
>>
>> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
>> without any 1:1 correspondence between Worker cores and Jobs.  Cores are
>> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at most
>> 1000 simultaneous Tasks, but that doesn't really tell you anything about
>> how many Jobs are or can be concurrently tracked by the DAGScheduler, which
>> will be apportioning the Tasks from those concurrent Jobs across the
>> available Executor cores.
>>
>> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com> wrote:
>>
>>> Good stuff, Evan.  Looks like this is utilizing the in-memory
>>> capabilities of FiloDB which is pretty cool.  looking forward to the
>>> webcast as I don't know much about FiloDB.
>>>
>>> My personal thoughts here are to removed Spark from the user
>>> request/response hot path.
>>>
>>> I can't tell you how many times i've had to unroll that architecture at
>>> clients - and replace with a real database like Cassandra, ElasticSearch,
>>> HBase, MySql.
>>>
>>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
>>> believe that Spark could be used as an application server.  This is not a
>>> good use case for Spark.
>>>
>>> Remember that every job that is launched by Spark requires 1 CPU core,
>>> some memory, and an available Executor JVM to provide the CPU and memory.
>>>
>>> Yes, you can horizontally scale this because of the distributed nature
>>> of Spark, however it is not an efficient scaling strategy.
>>>
>>> For example, if you're looking to scale out to 1000 concurrent requests,
>>> this is 1000 concurrent Spark jobs.  This would require a cluster with 1000
>>> cores.  this is just not cost effective.
>>>
>>> Use Spark for what it's good for - ad-hoc, interactive, and iterative
>>> (machine learning, graph) analytics.  Use an application server for what
>>> it's good - managing a large amount of concurrent requests.  And use a
>>> database for what it's good for - storing/retrieving data.
>>>
>>> And any serious production deployment will need failover, throttling,
>>> back pressure, auto-scaling, and service discovery.
>>>
>>> While Spark supports these to varying levels of production-readiness,
>>> Spark is a batch-oriented system and not meant to be put on the user
>>> request/response hot path.
>>>
>>> For the failover, throttling, back pressure, autoscaling that i
>>> mentioned above, it's worth checking out the suite of Netflix OSS -
>>> particularly Hystrix, Eureka, Zuul, Karyon, etc:
>>> http://netflix.github.io/
>>>
>>> Here's my github project that incorporates a lot of these:
>>> https://github.com/cfregly/fluxcapacitor
>>>
>>> Here's a netflix Skunkworks github project that packages these up in
>>> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>>>
>>>
>>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github <velvia.git...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I just wrote a blog post which might be really useful to you -- I have
>>>> just
>>>> benchmarked being able to achieve 700 queries per second in Spark.  So,
>>>> yes,
>>>> web speed SQL queries are definitely possible.   Read my new blog post:
>>>>
>>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>>>>
>>>> and feel free to email me (at vel...@gmail.com) if you would like to
>>>> follow
>>>> up.
>>>>
>>>> -Evan
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Chris Fregly*
>>> Principal Data Solutions Engineer
>>> IBM Spark Technology Center, San Francisco, CA
>>> http://spark.tc | http://advancedspark.com
>>>
>>
>>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>

Reply via email to