Re: Spark + Kafka processing trouble

2016-05-31 Thread Malcolm Lockyer
Thanks for the suggestions. I agree that there isn't some magic configuration setting, or that the sql options have some flaw - I just intended to explain the frustration of having a non-trivial (but still simple) Spark streaming job running on tiny amounts of data performing absolutely horribly.

Re: Spark + Kafka processing trouble

2016-05-31 Thread Cody Koeninger
> 500ms is I believe the minimum batch interval for Spark micro batching. It's better to test than to believe, I've run 250ms jobs. Same applies to the comments around JDBC, why assume when you could (dis)prove? It's not like it's a lot of effort to set up a minimal job that does

Re: Spark + Kafka processing trouble

2016-05-31 Thread Mich Talebzadeh
500ms is I believe the minimum batch interval for Spark micro batching. However, a JDBC call is a use of Unix file descriptor and context switch and it does have performance implication. That is irrespective of Kafka as it is happening one is actually going through Hive JDBC. It is a classic

Re: Spark + Kafka processing trouble

2016-05-31 Thread Cody Koeninger
There isn't a magic spark configuration setting that would account for multiple-second-long fixed overheads, you should be looking at maybe 200ms minimum for a streaming batch. 1024 kafka topicpartitions is not reasonable for the volume you're talking about. Unless you have really extreme

Re: Spark + Kafka processing trouble

2016-05-31 Thread Alonso Isidoro Roman
Mich`s idea is quite fine, if i was you, i will follow his idea... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2016-05-31 6:37 GMT+02:00 Mich Talebzadeh

Re: Spark + Kafka processing trouble

2016-05-30 Thread Mich Talebzadeh
how are you getting your data from the database. Are you using JDBC. Can you actually call the database first (assuming the same data, put it in temp table in Spark and cache it for the duration of windows length and use the data from the cached table? Dr Mich Talebzadeh LinkedIn *

Re: Spark + Kafka processing trouble

2016-05-30 Thread Malcolm Lockyer
On Tue, May 31, 2016 at 3:14 PM, Darren Govoni wrote: > Well that could be the problem. A SQL database is essential a big > synchronizer. If you have a lot of spark tasks all bottlenecking on a single > database socket (is the database clustered or colocated with spark

Re: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni
from my Verizon Wireless 4G LTE smartphone Original message From: Malcolm Lockyer <malcolm.lock...@hapara.com> Date: 05/30/2016 10:40 PM (GMT-05:00) To: user@spark.apache.org Subject: Re: Spark + Kafka processing trouble On Tue, May 31, 2016 at 1:56 PM, Darren Govon

Re: Spark + Kafka processing trouble

2016-05-30 Thread Malcolm Lockyer
On Tue, May 31, 2016 at 1:56 PM, Darren Govoni wrote: > So you are calling a SQL query (to a single database) within a spark > operation distributed across your workers? Yes, but currently with very small sets of data (1-10,000) and on a single (dev) machine right now.

RE: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni
-05:00) To: user@spark.apache.org Subject: Spark + Kafka processing trouble Hopefully this is not off topic for this list, but I am hoping to reach some people who have used Kafka + Spark before. We are new to Spark and are setting up our first production environment and hitting a speed issue that

Spark + Kafka processing trouble

2016-05-30 Thread Malcolm Lockyer
Hopefully this is not off topic for this list, but I am hoping to reach some people who have used Kafka + Spark before. We are new to Spark and are setting up our first production environment and hitting a speed issue that maybe configuration related - and we have little experience in configuring