Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
Yes, Tathagata, thank you. For #1, the 'need detection', one idea we're entertaining is timestamping the messages coming into the Kafka topics. The consumers would check the interval between the time they get the message and that message origination timestamp. As Kafka topics start to fill up more

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Tathagata Das
Let me try to add some clarity in the different thought directions that's going on in this thread. 1. HOW TO DETECT THE NEED FOR MORE CLUSTER RESOURCES? If there are not rate limits set up, the most reliable way to detect whether the current Spark cluster is being insufficient to handle the data

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Cody Koeninger
Depends on what you're reusing multiple times (if anything). Read http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence On Wed, Jun 10, 2015 at 12:18 AM, Dmitry Goldenberg < dgoldenberg...@gmail.com> wrote: > At which point would I call cache()? I just want the runtime to s

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
o:* Evo Eftimov > *Cc:* Cody Koeninger; Andrew Or; Gerard Maas; spark users > *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic > sizes/rate of growth in Kafka or Spark's metrics? > > > > Evo, > > > > One of the ideas is to shadow the current clus

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-09 Thread Dmitry Goldenberg
At which point would I call cache()? I just want the runtime to spill to disk when necessary without me having to know when the "necessary" is. On Thu, Jun 4, 2015 at 9:42 AM, Cody Koeninger wrote: > direct stream isn't a receiver, it isn't required to cache data anywhere > unless you want it

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Cody Koeninger
direct stream isn't a receiver, it isn't required to cache data anywhere unless you want it to. If you want it, just call cache. On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg wrote: > "set the storage policy for the DStream RDDs to MEMORY AND DISK" - it > appears the storage level can be sp

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Dmitry Goldenberg
"set the storage policy for the DStream RDDs to MEMORY AND DISK" - it appears the storage level can be specified in the createStream methods but not createDirectStream... On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov wrote: > You can also try Dynamic Resource Allocation > > > > > https://spark.a

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
June 3, 2015 4:46 PM > *To:* Evo Eftimov > *Cc:* Cody Koeninger; Andrew Or; Gerard Maas; spark users > *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic > sizes/rate of growth in Kafka or Spark's metrics? > > > > Evo, > > > > One of the ideas i

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
vo Eftimov > *Cc:* Cody Koeninger; Andrew Or; Gerard Maas; spark users > *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic > sizes/rate of growth in Kafka or Spark's metrics? > > > > Evo, > > > > One of the ideas is to shadow the current cluster. This

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
ldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday, June 3, 2015 4:46 PM > *To:* Evo Eftimov > *Cc:* Cody Koeninger; Andrew Or; Gerard Maas; spark users > *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic > sizes/rate of growth in Kafka or Spark's metrics? &

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov
more From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Wednesday, June 3, 2015 4:46 PM To: Evo Eftimov Cc: Cody Koeninger; Andrew Or; Gerard Maas; spark users Subject: Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
er cluster > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday, June 3, 2015 4:14 PM > *To:* Cody Koeninger > *Cc:* Andrew Or; Evo Eftimov; Gerard Maas; spark users > *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic > sizes/rate

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov
Maas; spark users Subject: Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics? Would it be possible to implement Spark autoscaling somewhat along these lines? -- 1. If we sense that a new machine is needed, by watching the data lo

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
;> Until there is free RAM, spark streaming (spark) will NOT resort to disk – >>> and of course resorting to disk from time to time (ie when there is no free >>> RAM ) and taking a performance hit from that, BUT only until there is no >>> free RAM >>> >>> >>&

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
isk from time to time (ie when there is no free >>> RAM ) and taking a performance hit from that, BUT only until there is no >>> free RAM >>> >>> >>> >>> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] >>> *Sent:* Thursday, May

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Cody Koeninger
time (ie when there is no free >> RAM ) and taking a performance hit from that, BUT only until there is no >> free RAM >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] >> *Sent:* Thursday, May 28, 2015 2:34 PM >> *To:*

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
; free RAM >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] >> *Sent:* Thursday, May 28, 2015 2:34 PM >> *To:* Evo Eftimov >> *Cc:* Gerard Maas; spark users >> *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Andrew Or
y until there is no > free RAM > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Thursday, May 28, 2015 2:34 PM > *To:* Evo Eftimov > *Cc:* Gerard Maas; spark users > *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic > sizes/r

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
sizes/rate of growth in Kafka or Spark's metrics? Evo, good points. On the dynamic resource allocation, I'm surmising this only works within a particular cluster setup. So it improves the usage of current cluster resources but it doesn't make the cluster itself elastic. At l

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Evo, good points. On the dynamic resource allocation, I'm surmising this only works within a particular cluster setup. So it improves the usage of current cluster resources but it doesn't make the cluster itself elastic. At least, that's my understanding. Memory + disk would be good and hopefull

FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
You can also try Dynamic Resource Allocation https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation Also re the Feedback Loop for automatic message consumption rate adjustment – there is a “dumb” solution option – simply set the storage policy for the DStrea