At least for simple queries, the DAGScheduler does not appear to be the bottleneck - since we are able to schedule 700 queries, and all the scheduling is probably done from the main application thread.
However, I did have high hopes for Sparrow. What was the reason they decided not to include that? On Fri, Mar 11, 2016 at 1:52 AM, Hemant Bhanawat <hemant9...@gmail.com> wrote: > Spark-jobserver is an elegant product that builds concurrency on top of > Spark. But, the current design of DAGScheduler prevents Spark to become a > truly concurrent solution for low latency queries. DagScheduler will turn > out to be a bottleneck for low latency queries. Sparrow project was an > effort to make Spark more suitable for such scenarios but it never made it > to the Spark codebase. If Spark has to become a highly concurrent solution, > scheduling has to be distributed. > > Hemant Bhanawat > www.snappydata.io > > On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <ch...@fregly.com> wrote: >> >> great discussion, indeed. >> >> Mark Hamstra and i spoke offline just now. >> >> Below is a quick recap of our discussion on how they've achieved >> acceptable performance from Spark on the user request/response path (@mark- >> feel free to correct/comment). >> >> 1) there is a big difference in request/response latency between >> submitting a full Spark Application (heavy weight) versus having a >> long-running Spark Application (like Spark Job Server) that submits >> lighter-weight Jobs using a shared SparkContext. mark is obviously using >> the latter - a long-running Spark App. >> >> 2) there are some enhancements to Spark that are required to achieve >> acceptable user request/response times. some links that Mark provided are >> as follows: >> >> https://issues.apache.org/jira/browse/SPARK-11838 >> https://github.com/apache/spark/pull/11036 >> https://github.com/apache/spark/pull/11403 >> https://issues.apache.org/jira/browse/SPARK-13523 >> https://issues.apache.org/jira/browse/SPARK-13756 >> >> Essentially, a deeper level of caching at the shuffle file layer to reduce >> compute and memory between queries. >> >> Note that Mark is running a slightly-modified version of stock Spark. >> (He's mentioned this in prior posts, as well.) >> >> And I have to say that I'm, personally, seeing more and more >> slightly-modified versions of Spark being deployed to production to >> workaround outstanding PR's and Jiras. >> >> this may not be what people want to hear, but it's a trend that i'm seeing >> lately as more and more customize Spark to their specific use cases. >> >> Anyway, thanks for the good discussion, everyone! This is why we have >> these lists, right! :) >> >> >> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan <velvia.git...@gmail.com> >> wrote: >>> >>> One of the premises here is that if you can restrict your workload to >>> fewer cores - which is easier with FiloDB and careful data modeling - >>> you can make this work for much higher concurrency and lower latency >>> than most typical Spark use cases. >>> >>> The reason why it typically does not work in production is that most >>> people are using HDFS and files. These data sources are designed for >>> running queries and workloads on all your cores across many workers, >>> and not for filtering your workload down to only one or two cores. >>> >>> There is actually nothing inherent in Spark that prevents people from >>> using it as an app server. However, the insistence on using it with >>> HDFS is what kills concurrency. This is why FiloDB is important. >>> >>> I agree there are more optimized stacks for running app servers, but >>> the choices that you mentioned: ES is targeted at text search; Cass >>> and HBase by themselves are not fast enough for analytical queries >>> that the OP wants; and MySQL is great but not scalable. Probably >>> something like VectorWise, HANA, Vertica would work well, but those >>> are mostly not free solutions. Druid could work too if the use case >>> is right. >>> >>> Anyways, great discussion! >>> >>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com> wrote: >>> > you are correct, mark. i misspoke. apologies for the confusion. >>> > >>> > so the problem is even worse given that a typical job requires multiple >>> > tasks/cores. >>> > >>> > i have yet to see this particular architecture work in production. i >>> > would >>> > love for someone to prove otherwise. >>> > >>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <m...@clearstorydata.com> >>> > wrote: >>> >>> >>> >>> For example, if you're looking to scale out to 1000 concurrent >>> >>> requests, >>> >>> this is 1000 concurrent Spark jobs. This would require a cluster >>> >>> with 1000 >>> >>> cores. >>> >> >>> >> >>> >> This doesn't make sense. A Spark Job is a driver/DAGScheduler concept >>> >> without any 1:1 correspondence between Worker cores and Jobs. Cores >>> >> are >>> >> used to run Tasks, not Jobs. So, yes, a 1000 core cluster can run at >>> >> most >>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything >>> >> about how >>> >> many Jobs are or can be concurrently tracked by the DAGScheduler, >>> >> which will >>> >> be apportioning the Tasks from those concurrent Jobs across the >>> >> available >>> >> Executor cores. >>> >> >>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com> >>> >> wrote: >>> >>> >>> >>> Good stuff, Evan. Looks like this is utilizing the in-memory >>> >>> capabilities of FiloDB which is pretty cool. looking forward to the >>> >>> webcast >>> >>> as I don't know much about FiloDB. >>> >>> >>> >>> My personal thoughts here are to removed Spark from the user >>> >>> request/response hot path. >>> >>> >>> >>> I can't tell you how many times i've had to unroll that architecture >>> >>> at >>> >>> clients - and replace with a real database like Cassandra, >>> >>> ElasticSearch, >>> >>> HBase, MySql. >>> >>> >>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you to >>> >>> believe that Spark could be used as an application server. This is >>> >>> not a >>> >>> good use case for Spark. >>> >>> >>> >>> Remember that every job that is launched by Spark requires 1 CPU >>> >>> core, >>> >>> some memory, and an available Executor JVM to provide the CPU and >>> >>> memory. >>> >>> >>> >>> Yes, you can horizontally scale this because of the distributed >>> >>> nature of >>> >>> Spark, however it is not an efficient scaling strategy. >>> >>> >>> >>> For example, if you're looking to scale out to 1000 concurrent >>> >>> requests, >>> >>> this is 1000 concurrent Spark jobs. This would require a cluster >>> >>> with 1000 >>> >>> cores. this is just not cost effective. >>> >>> >>> >>> Use Spark for what it's good for - ad-hoc, interactive, and iterative >>> >>> (machine learning, graph) analytics. Use an application server for >>> >>> what >>> >>> it's good - managing a large amount of concurrent requests. And use >>> >>> a >>> >>> database for what it's good for - storing/retrieving data. >>> >>> >>> >>> And any serious production deployment will need failover, throttling, >>> >>> back pressure, auto-scaling, and service discovery. >>> >>> >>> >>> While Spark supports these to varying levels of production-readiness, >>> >>> Spark is a batch-oriented system and not meant to be put on the user >>> >>> request/response hot path. >>> >>> >>> >>> For the failover, throttling, back pressure, autoscaling that i >>> >>> mentioned >>> >>> above, it's worth checking out the suite of Netflix OSS - >>> >>> particularly >>> >>> Hystrix, Eureka, Zuul, Karyon, etc: http://netflix.github.io/ >>> >>> >>> >>> Here's my github project that incorporates a lot of these: >>> >>> https://github.com/cfregly/fluxcapacitor >>> >>> >>> >>> Here's a netflix Skunkworks github project that packages these up in >>> >>> Docker images: https://github.com/Netflix-Skunkworks/zerotodocker >>> >>> >>> >>> >>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github >>> >>> <velvia.git...@gmail.com> >>> >>> wrote: >>> >>>> >>> >>>> Hi, >>> >>>> >>> >>>> I just wrote a blog post which might be really useful to you -- I >>> >>>> have >>> >>>> just >>> >>>> benchmarked being able to achieve 700 queries per second in Spark. >>> >>>> So, >>> >>>> yes, >>> >>>> web speed SQL queries are definitely possible. Read my new blog >>> >>>> post: >>> >>>> >>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/ >>> >>>> >>> >>>> and feel free to email me (at vel...@gmail.com) if you would like to >>> >>>> follow >>> >>>> up. >>> >>>> >>> >>>> -Evan >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> View this message in context: >>> >>>> >>> >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html >>> >>>> Sent from the Apache Spark User List mailing list archive at >>> >>>> Nabble.com. >>> >>>> >>> >>>> >>> >>>> --------------------------------------------------------------------- >>> >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >>>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> >>> >>> Chris Fregly >>> >>> Principal Data Solutions Engineer >>> >>> IBM Spark Technology Center, San Francisco, CA >>> >>> http://spark.tc | http://advancedspark.com >>> >> >>> >> >>> > >>> > >>> > >>> > -- >>> > >>> > Chris Fregly >>> > Principal Data Solutions Engineer >>> > IBM Spark Technology Center, San Francisco, CA >>> > http://spark.tc | http://advancedspark.com >> >> >> >> >> -- >> >> Chris Fregly >> Principal Data Solutions Engineer >> IBM Spark Technology Center, San Francisco, CA >> http://spark.tc | http://advancedspark.com > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org