This is really outside of the scope of Hive and would probably be better addressed by the Spark community, however I can say that this very much depends on your use case....
Take a look at this discussion if you haven't already: https://groups.google.com/forum/embed/#!topic/spark-users/GQoxJHAAtX4 Generally speaking, the larger the batch window, the better the overall performance, but the streaming data output will be updated less frequently.....you will likely run into problems setting your batch window < 0.5 sec, and/or when the batch window < the amount of time it takes to run the task.... Beyond that, the window length and sliding interval need to be multiples of the batch window, but will depend entirely on your reporting requirements. it would be perfectly reasonable to have batch window = 30 secs window length = 1 hour sliding interval = 5 mins In that case, you'd be creating an output every 5 mins, aggregating data that you were collecting every 30 seconds over a previous 1 hour period of time... could you set the batch window to 5 mins? Possibly, depending on the data source, but perhaps you are already using that source on a more frequent basis elsewhere....or maybe you only have a 1 min buffer on the source data....lots of possibilities, which is why there is the flexibility and no hard/fast rule.... If you were trying to create continuously streaming output as fast as possible, then you would probably (almost always) be setting your sliding interval = batch window and then shrinking the batch window as short as possible. More documentation here: https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Thursday, May 05, 2016 4:26 AM To: user Subject: Re: Spark Streaming, Batch interval, Windows length and Sliding Interval settings Any ideas/experience on this? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 4 May 2016 at 21:45, Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote: Hi, Just wanted opinions on this. In Spark streaming the parameter val ssc = new StreamingContext(sparkConf, Seconds(n)) defines the batch or sample interval for the incoming streams In addition there is windows Length // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = L And fibally the sliding interval // sliding interval - The interval at which the window operation is performed val slidingInterval = I OK so as given the windowLength L = multiples of n and the slidingInterval has to be consistent to ensure that we can the head and tail of the window. So as a heuristic approach for a batch interval of say 10 seconds, I put the windows length at 3 times that = 30 seconds and make the slidinginterval = batch interval = 10. Obviously these are subjective depending on what is being measured. However, I believe having slidinginterval = batch interval makes sense? Appreciate any views on this. Thanks, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> ====================================================================== THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately. Thank you.