At which point would I call cache()? I just want the runtime to spill to disk when necessary without me having to know when the "necessary" is.
On Thu, Jun 4, 2015 at 9:42 AM, Cody Koeninger <c...@koeninger.org> wrote: > direct stream isn't a receiver, it isn't required to cache data anywhere > unless you want it to. > > If you want it, just call cache. > > On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "set the storage policy for the DStream RDDs to MEMORY AND DISK" - it >> appears the storage level can be specified in the createStream methods but >> not createDirectStream... >> >> >> On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov <evo.efti...@isecc.com> >> wrote: >> >>> You can also try Dynamic Resource Allocation >>> >>> >>> >>> >>> https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation >>> >>> >>> >>> Also re the Feedback Loop for automatic message consumption rate >>> adjustment – there is a “dumb” solution option – simply set the storage >>> policy for the DStream RDDs to MEMORY AND DISK – when the memory gets >>> exhausted spark streaming will resort to keeping new RDDs on disk which >>> will prevent it from crashing and hence loosing them. Then some memory will >>> get freed and it will resort back to RAM and so on and so forth >>> >>> >>> >>> >>> >>> Sent from Samsung Mobile >>> >>> -------- Original message -------- >>> >>> From: Evo Eftimov >>> >>> Date:2015/05/28 13:22 (GMT+00:00) >>> >>> To: Dmitry Goldenberg >>> >>> Cc: Gerard Maas ,spark users >>> >>> Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of >>> growth in Kafka or Spark's metrics? >>> >>> >>> >>> You can always spin new boxes in the background and bring them into the >>> cluster fold when fully operational and time that with job relaunch and >>> param change >>> >>> >>> >>> Kafka offsets are mabaged automatically for you by the kafka clients >>> which keep them in zoomeeper dont worry about that ad long as you shut down >>> your job gracefuly. Besides msnaging the offsets explicitly is not a big >>> deal if necessary >>> >>> >>> >>> >>> >>> Sent from Samsung Mobile >>> >>> >>> >>> -------- Original message -------- >>> >>> From: Dmitry Goldenberg >>> >>> Date:2015/05/28 13:16 (GMT+00:00) >>> >>> To: Evo Eftimov >>> >>> Cc: Gerard Maas ,spark users >>> >>> Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of >>> growth in Kafka or Spark's metrics? >>> >>> >>> >>> Thanks, Evo. Per the last part of your comment, it sounds like we will >>> need to implement a job manager which will be in control of starting the >>> jobs, monitoring the status of the Kafka topic(s), shutting jobs down and >>> marking them as ones to relaunch, scaling the cluster up/down by >>> adding/removing machines, and relaunching the 'suspended' (shut down) jobs. >>> >>> >>> >>> I suspect that relaunching the jobs may be tricky since that means >>> keeping track of the starter offsets in Kafka topic(s) from which the jobs >>> started working on. >>> >>> >>> >>> Ideally, we'd want to avoid a re-launch. The 'suspension' and >>> relaunching of jobs, coupled with the wait for the new machines to come >>> online may turn out quite time-consuming which will make for lengthy >>> request times, and our requests are not asynchronous. Ideally, the >>> currently running jobs would continue to run on the machines currently >>> available in the cluster. >>> >>> >>> >>> In the scale-down case, the job manager would want to signal to Spark's >>> job scheduler not to send work to the node being taken out, find out when >>> the last job has finished running on the node, then take the node out. >>> >>> >>> >>> This is somewhat like changing the number of cylinders in a car engine >>> while the car is running... >>> >>> >>> >>> Sounds like a great candidate for a set of enhancements in Spark... >>> >>> >>> >>> On Thu, May 28, 2015 at 7:52 AM, Evo Eftimov <evo.efti...@isecc.com> >>> wrote: >>> >>> @DG; The key metrics should be >>> >>> >>> >>> - Scheduling delay – its ideal state is to remain constant >>> over time and ideally be less than the time of the microbatch window >>> >>> - The average job processing time should remain less than the >>> micro-batch window >>> >>> - Number of Lost Jobs – even if there is a single Job lost >>> that means that you have lost all messages for the DStream RDD processed by >>> that job due to the previously described spark streaming memory leak >>> condition and subsequent crash – described in previous postings submitted >>> by me >>> >>> >>> >>> You can even go one step further and periodically issue “get/check free >>> memory” to see whether it is decreasing relentlessly at a constant rate – >>> if it touches a predetermined RAM threshold that should be your third >>> metric >>> >>> >>> >>> Re the “back pressure” mechanism – this is a Feedback Loop mechanism and >>> you can implement one on your own without waiting for Jiras and new >>> features whenever they might be implemented by the Spark dev team – >>> moreover you can avoid using slow mechanisms such as ZooKeeper and even >>> incorporate some Machine Learning in your Feedback Loop to make it handle >>> the message consumption rate more intelligently and benefit from ongoing >>> online learning – BUT this is STILL about voluntarily sacrificing your >>> performance in the name of keeping your system stable – it is not about >>> scaling your system/solution >>> >>> >>> >>> In terms of how to scale the Spark Framework Dynamically – even though >>> this is not supported at the moment out of the box I guess you can have a >>> sys management framework spin dynamically a few more boxes (spark worker >>> nodes), stop dynamically your currently running Spark Streaming Job, >>> relaunch it with new params e.g. more Receivers, larger number of >>> Partitions (hence tasks), more RAM per executor etc. Obviously this will >>> cause some temporary delay in fact interruption in your processing but if >>> the business use case can tolerate that then go for it >>> >>> >>> >>> *From:* Gerard Maas [mailto:gerard.m...@gmail.com] >>> *Sent:* Thursday, May 28, 2015 12:36 PM >>> *To:* dgoldenberg >>> *Cc:* spark users >>> *Subject:* Re: Autoscaling Spark cluster based on topic sizes/rate of >>> growth in Kafka or Spark's metrics? >>> >>> >>> >>> Hi, >>> >>> >>> >>> tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark >>> streaming processes is not supported. >>> >>> >>> >>> >>> >>> *Longer version.* >>> >>> >>> >>> I assume that you are talking about Spark Streaming as the discussion is >>> about handing Kafka streaming data. >>> >>> >>> >>> Then you have two things to consider: the Streaming receivers and the >>> Spark processing cluster. >>> >>> >>> >>> Currently, the receiving topology is static. One receiver is allocated >>> with each DStream instantiated and it will use 1 core in the cluster. Once >>> the StreamingContext is started, this topology cannot be changed, therefore >>> the number of Kafka receivers is fixed for the lifetime of your DStream. >>> >>> What we do is to calculate the cluster capacity and use that as a fixed >>> upper bound (with a margin) for the receiver throughput. >>> >>> >>> >>> There's work in progress to add a reactive model to the receiver, where >>> backpressure can be applied to handle overload conditions. See >>> https://issues.apache.org/jira/browse/SPARK-7398 >>> >>> >>> >>> Once the data is received, it will be processed in a 'classical' Spark >>> pipeline, so previous posts on spark resource scheduling might apply. >>> >>> >>> >>> Regarding metrics, the standard metrics subsystem of spark will report >>> streaming job performance. Check the driver's metrics endpoint to peruse >>> the available metrics: >>> >>> >>> >>> <driver>:<ui-port>/metrics/json >>> >>> >>> >>> -kr, Gerard. >>> >>> >>> >>> >>> >>> (*) Spark is a project that moves so fast that statements might be >>> invalidated by new work every minute. >>> >>> >>> >>> On Thu, May 28, 2015 at 1:21 AM, dgoldenberg <dgoldenberg...@gmail.com> >>> wrote: >>> >>> Hi, >>> >>> I'm trying to understand if there are design patterns for autoscaling >>> Spark >>> (add/remove slave machines to the cluster) based on the throughput. >>> >>> Assuming we can throttle Spark consumers, the respective Kafka topics we >>> stream data from would start growing. What are some of the ways to >>> generate >>> the metrics on the number of new messages and the rate they are piling >>> up? >>> This perhaps is more of a Kafka question; I see a pretty sparse javadoc >>> with >>> the Metric interface and not much else... >>> >>> What are some of the ways to expand/contract the Spark cluster? Someone >>> has >>> mentioned Mesos... >>> >>> I see some info on Spark metrics in the Spark monitoring guide >>> <https://spark.apache.org/docs/latest/monitoring.html> . Do we want to >>> perhaps implement a custom sink that would help us autoscale up or down >>> based on the throughput? >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tp23062.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> >>> >>> >> >> >