The fact that a typical Job requires multiple Tasks is not a problem, but rather an opportunity for the Scheduler to interleave the workloads of multiple concurrent Jobs across the available cores.
I work every day with such a production architecture with Spark on the user request/response hot path. On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com> wrote: > you are correct, mark. i misspoke. apologies for the confusion. > > so the problem is even worse given that a typical job requires multiple > tasks/cores. > > i have yet to see this particular architecture work in production. i > would love for someone to prove otherwise. > > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> For example, if you're looking to scale out to 1000 concurrent requests, >>> this is 1000 concurrent Spark jobs. This would require a cluster with 1000 >>> cores. >> >> >> This doesn't make sense. A Spark Job is a driver/DAGScheduler concept >> without any 1:1 correspondence between Worker cores and Jobs. Cores are >> used to run Tasks, not Jobs. So, yes, a 1000 core cluster can run at most >> 1000 simultaneous Tasks, but that doesn't really tell you anything about >> how many Jobs are or can be concurrently tracked by the DAGScheduler, which >> will be apportioning the Tasks from those concurrent Jobs across the >> available Executor cores. >> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com> wrote: >> >>> Good stuff, Evan. Looks like this is utilizing the in-memory >>> capabilities of FiloDB which is pretty cool. looking forward to the >>> webcast as I don't know much about FiloDB. >>> >>> My personal thoughts here are to removed Spark from the user >>> request/response hot path. >>> >>> I can't tell you how many times i've had to unroll that architecture at >>> clients - and replace with a real database like Cassandra, ElasticSearch, >>> HBase, MySql. >>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you to >>> believe that Spark could be used as an application server. This is not a >>> good use case for Spark. >>> >>> Remember that every job that is launched by Spark requires 1 CPU core, >>> some memory, and an available Executor JVM to provide the CPU and memory. >>> >>> Yes, you can horizontally scale this because of the distributed nature >>> of Spark, however it is not an efficient scaling strategy. >>> >>> For example, if you're looking to scale out to 1000 concurrent requests, >>> this is 1000 concurrent Spark jobs. This would require a cluster with 1000 >>> cores. this is just not cost effective. >>> >>> Use Spark for what it's good for - ad-hoc, interactive, and iterative >>> (machine learning, graph) analytics. Use an application server for what >>> it's good - managing a large amount of concurrent requests. And use a >>> database for what it's good for - storing/retrieving data. >>> >>> And any serious production deployment will need failover, throttling, >>> back pressure, auto-scaling, and service discovery. >>> >>> While Spark supports these to varying levels of production-readiness, >>> Spark is a batch-oriented system and not meant to be put on the user >>> request/response hot path. >>> >>> For the failover, throttling, back pressure, autoscaling that i >>> mentioned above, it's worth checking out the suite of Netflix OSS - >>> particularly Hystrix, Eureka, Zuul, Karyon, etc: >>> http://netflix.github.io/ >>> >>> Here's my github project that incorporates a lot of these: >>> https://github.com/cfregly/fluxcapacitor >>> >>> Here's a netflix Skunkworks github project that packages these up in >>> Docker images: https://github.com/Netflix-Skunkworks/zerotodocker >>> >>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github <velvia.git...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I just wrote a blog post which might be really useful to you -- I have >>>> just >>>> benchmarked being able to achieve 700 queries per second in Spark. So, >>>> yes, >>>> web speed SQL queries are definitely possible. Read my new blog post: >>>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/ >>>> >>>> and feel free to email me (at vel...@gmail.com) if you would like to >>>> follow >>>> up. >>>> >>>> -Evan >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >>> >>> -- >>> >>> *Chris Fregly* >>> Principal Data Solutions Engineer >>> IBM Spark Technology Center, San Francisco, CA >>> http://spark.tc | http://advancedspark.com >>> >> >> > > > -- > > *Chris Fregly* > Principal Data Solutions Engineer > IBM Spark Technology Center, San Francisco, CA > http://spark.tc | http://advancedspark.com >