Re: Spark + Kafka processing trouble

2016-05-31 Thread Malcolm Lockyer
Thanks for the suggestions. I agree that there isn't some magic configuration setting, or that the sql options have some flaw - I just intended to explain the frustration of having a non-trivial (but still simple) Spark streaming job running on tiny amounts of data performing absolutely horribly.

Re: Spark + Kafka processing trouble

2016-05-31 Thread Cody Koeninger
> 500ms is I believe the minimum batch interval for Spark micro batching. It's better to test than to believe, I've run 250ms jobs. Same applies to the comments around JDBC, why assume when you could (dis)prove? It's not like it's a lot of effort to set up a minimal job that does foreach(printl

Re: Spark + Kafka processing trouble

2016-05-31 Thread Mich Talebzadeh
500ms is I believe the minimum batch interval for Spark micro batching. However, a JDBC call is a use of Unix file descriptor and context switch and it does have performance implication. That is irrespective of Kafka as it is happening one is actually going through Hive JDBC. It is a classic data

Re: Spark + Kafka processing trouble

2016-05-31 Thread Cody Koeninger
There isn't a magic spark configuration setting that would account for multiple-second-long fixed overheads, you should be looking at maybe 200ms minimum for a streaming batch. 1024 kafka topicpartitions is not reasonable for the volume you're talking about. Unless you have really extreme workloa

Re: Spark + Kafka processing trouble

2016-05-31 Thread Alonso Isidoro Roman
Mich`s idea is quite fine, if i was you, i will follow his idea... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2016-05-31 6:37 GMT+02:00 Mich Taleb

Re: Spark + Kafka processing trouble

2016-05-30 Thread Mich Talebzadeh
how are you getting your data from the database. Are you using JDBC. Can you actually call the database first (assuming the same data, put it in temp table in Spark and cache it for the duration of windows length and use the data from the cached table? Dr Mich Talebzadeh LinkedIn * https://ww

Re: Spark + Kafka processing trouble

2016-05-30 Thread Malcolm Lockyer
On Tue, May 31, 2016 at 3:14 PM, Darren Govoni wrote: > Well that could be the problem. A SQL database is essential a big > synchronizer. If you have a lot of spark tasks all bottlenecking on a single > database socket (is the database clustered or colocated with spark workers?) > then you will ha

Re: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni
from my Verizon Wireless 4G LTE smartphone Original message From: Malcolm Lockyer Date: 05/30/2016 10:40 PM (GMT-05:00) To: user@spark.apache.org Subject: Re: Spark + Kafka processing trouble On Tue, May 31, 2016 at 1:56 PM, Darren Govoni wrote: > So you are calling a

Re: Spark + Kafka processing trouble

2016-05-30 Thread Malcolm Lockyer
On Tue, May 31, 2016 at 1:56 PM, Darren Govoni wrote: > So you are calling a SQL query (to a single database) within a spark > operation distributed across your workers? Yes, but currently with very small sets of data (1-10,000) and on a single (dev) machine right now. (sorry didn't reply to

RE: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni
So you are calling a SQL query (to a single database) within a spark operation distributed across your workers?  Sent from my Verizon Wireless 4G LTE smartphone Original message From: Malcolm Lockyer Date: 05/30/2016 9:45 PM (GMT-05:00) To: user@spark.apache.org Su