Thanks for the suggestions. I agree that there isn't some magic configuration setting, or that the sql options have some flaw - I just intended to explain the frustration of having a non-trivial (but still simple) Spark streaming job running on tiny amounts of data performing absolutely horribly.
.count() is something I was adding to try and force calculation and agree it might not be the best of tests. On Wed, Jun 1, 2016 at 2:34 AM, Cody Koeninger <c...@koeninger.org> wrote: > There isn't a magic spark configuration setting that would account for > multiple-second-long fixed overheads, you should be looking at maybe > 200ms minimum for a streaming batch. 1024 kafka topicpartitions is > not reasonable for the volume you're talking about. Unless you have > really extreme workloads, 32 or 64 is a better starting guess. > > Rather than jumping to conclusions about sql operations being the > problem, start from the very beginning. Read a stream of messages > from kafka and just do .foreach(println), at a reasonable batch size > (say 500ms or a second), and see how that keeps up in your > environment. Don't use take(), don't use count(), don't use print(), > since they may have non-obvious performance implications. > > If that works, add on further operations one step at a time and see > when issues arise. > > On Mon, May 30, 2016 at 8:45 PM, Malcolm Lockyer > <malcolm.lock...@hapara.com> wrote: >> Hopefully this is not off topic for this list, but I am hoping to >> reach some people who have used Kafka + Spark before. >> >> We are new to Spark and are setting up our first production >> environment and hitting a speed issue that maybe configuration related >> - and we have little experience in configuring Spark environments. >> >> So we've got a Spark streaming job that seems to take an inordinate >> amount of time to process. I realize that without specifics, it is >> difficult to trace - however the most basic primitives in Spark are >> performing horribly. The lazy nature of Spark is making it difficult >> for me to understand what is happening - any suggestions are very much >> appreciated. >> >> Environment is MBP 2.2 i7. Spark master is "local[*]". We are using >> Kafka and PostgreSQL, both local. The job is designed to: >> >> a) grab some data from Kafka >> b) correlate with existing data in PostgreSQL >> c) output data to Kafka >> >> I am isolating timings by calling System.nanoTime() before and after >> something that forces calculation, for example .count() on a >> DataFrame. It seems like every operation has a MASSIVE fixed overhead >> and that is stacking up making each iteration on the RDD extremely >> slow. Slow operations include pulling a single item from the Kafka >> queue, running a simple query against PostgresSQL, and running a Spark >> aggregation on a RDD with a handful of rows. >> >> The machine is not maxing out on memory, disk or CPU. The machine >> seems to be doing nothing for a high percentage of the execution time. >> We have reproduced this behavior on two other machines. So we're >> suspecting a configuration issue >> >> As a concrete example, we have a DataFrame produced by running a JDBC >> query by mapping over an RDD from Kafka. Calling count() (I guess >> forcing execution) on this DataFrame when there is *1* item/row (Note: >> SQL database is EMPTY at this point so this is not a factor) takes 4.5 >> seconds, calling count when there are 10,000 items takes 7 seconds. >> >> Can anybody offer experience of something like this happening for >> them? Any suggestions on how to understand what is going wrong? >> >> I have tried tuning the number of Kafka partitions - increasing this >> seems to increase the concurrency and ultimately number of things >> processed per minute, but to get something half decent, I'm going to >> need running with 1024 or more partitions. Is 1024 partitions a >> reasonable number? What do you use in you environments? >> >> I've tried different options for batchDuration. The calculation seems >> to be batchDuration * Kafka partitions for number of items per >> iteration, but this is always still extremely slow (many per iteration >> vs. very few doesn't seem to really improve things). Can you suggest a >> list of the Spark configuration parameters related to speed that you >> think are key - preferably with the values you use for those >> parameters? >> >> I'd really really appreciate any help or suggestions as I've been >> working on this speed issue for 3 days without success and my head is >> starting to hurt. Thanks in advance. >> >> >> >> Thanks, >> >> -- >> >> Malcolm Lockyer >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> -- Malcolm Lockyer M: +64 21 258 6121 Level 10, 99 Queen Street, Auckland, New Zealand hapara.com ● @hapara_team Check out this video! --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org