Window Size
Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will work. Although they start running simultaneously, they might have different number of elements in each time interval. The following is output for two streams which have same number of elements and ran simultaneously. The left most value is the number of elements in each window. If we add the number of elements them, they are same for both streams but we can't compare both streams as they are different in window size and number of windows. Can we somehow make windows based on real time values for both streams? or Can we make windows based on number of elements? (n, (mean, varience, SD)) Stream 1 (7462,(1.0535658165371238,4242.001306434091,65.13064798107025)) (44826,(0.2546925855084064,5042.890184382894,71.0133099100647)) (245466,(0.2857731601728941,5014.411691661449,70.81251084138628)) (154852,(0.21907814309792514,3483.800160602281,59.023725404300606)) (156345,(0.3075668844414613,7449.528181550462,86.31064929399189)) (156603,(0.27785151491351234,5917.809892281489,76.9273026452994)) (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623)) Stream 2 (10493,(0.5554953964547791,1254.883548218503,35.42433553672536)) (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975)) (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792)) (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888)) (269817,(0.16987953223480945,3270.663944782799,57.18971887308766)) (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577)) Regards,Laeeq
spark streaming window operations on a large window size
Hi guys, does spark streaming supports window operations on a sliding window that is data is larger than the available memory? we would like to currently we are using kafka as input, but we could change that if needed. thanks Avi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: spark streaming window operations on a large window size
I don't think current Spark Streaming supports window operations which beyond its available memory, internally Spark Streaming puts all the data in the memory belongs to the effective window, if the memory is not enough, BlockManager will discard the blocks at LRU policy, so something unexpected will be occurred. Thanks Jerry -Original Message- From: avilevi3 [mailto:avile...@gmail.com] Sent: Monday, February 23, 2015 12:57 AM To: user@spark.apache.org Subject: spark streaming window operations on a large window size Hi guys, does spark streaming supports window operations on a sliding window that is data is larger than the available memory? we would like to currently we are using kafka as input, but we could change that if needed. thanks Avi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark streaming window operations on a large window size
The default persistence level is MEMORY_AND_DISK, so the LRU policy would discard the blocks to disk, so the streaming app will not fail. However, since things will get constantly read in and out of disk as windows are processed, the performance wont be great. So it is best to have sufficient memory to keep all the window data in memory. TD On Mon, Feb 23, 2015 at 8:26 AM, Shao, Saisai wrote: > I don't think current Spark Streaming supports window operations which > beyond its available memory, internally Spark Streaming puts all the data > in the memory belongs to the effective window, if the memory is not enough, > BlockManager will discard the blocks at LRU policy, so something unexpected > will be occurred. > > Thanks > Jerry > > -Original Message- > From: avilevi3 [mailto:avile...@gmail.com] > Sent: Monday, February 23, 2015 12:57 AM > To: user@spark.apache.org > Subject: spark streaming window operations on a large window size > > Hi guys, > > does spark streaming supports window operations on a sliding window that > is data is larger than the available memory? > we would like to > currently we are using kafka as input, but we could change that if needed. > > thanks > Avi > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: spark streaming window operations on a large window size
@Tathagata Das so basically you are saying it is supported out of the box, but we should expect a significant performance hit - is that right? On Tue, Feb 24, 2015 at 5:37 AM, Tathagata Das wrote: > The default persistence level is MEMORY_AND_DISK, so the LRU policy would > discard the blocks to disk, so the streaming app will not fail. However, > since things will get constantly read in and out of disk as windows are > processed, the performance wont be great. So it is best to have sufficient > memory to keep all the window data in memory. > > TD > > On Mon, Feb 23, 2015 at 8:26 AM, Shao, Saisai > wrote: > >> I don't think current Spark Streaming supports window operations which >> beyond its available memory, internally Spark Streaming puts all the data >> in the memory belongs to the effective window, if the memory is not enough, >> BlockManager will discard the blocks at LRU policy, so something unexpected >> will be occurred. >> >> Thanks >> Jerry >> >> -Original Message- >> From: avilevi3 [mailto:avile...@gmail.com] >> Sent: Monday, February 23, 2015 12:57 AM >> To: user@spark.apache.org >> Subject: spark streaming window operations on a large window size >> >> Hi guys, >> >> does spark streaming supports window operations on a sliding window that >> is data is larger than the available memory? >> we would like to >> currently we are using kafka as input, but we could change that if needed. >> >> thanks >> Avi >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional >> commands, e-mail: user-h...@spark.apache.org >> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: spark streaming window operations on a large window size
Yes. On Mon, Feb 23, 2015 at 11:16 PM, Avi Levi wrote: > @Tathagata Das so basically you are saying it is supported out of the > box, but we should expect a significant performance hit - is that right? > > > > On Tue, Feb 24, 2015 at 5:37 AM, Tathagata Das > wrote: > >> The default persistence level is MEMORY_AND_DISK, so the LRU policy would >> discard the blocks to disk, so the streaming app will not fail. However, >> since things will get constantly read in and out of disk as windows are >> processed, the performance wont be great. So it is best to have sufficient >> memory to keep all the window data in memory. >> >> TD >> >> On Mon, Feb 23, 2015 at 8:26 AM, Shao, Saisai >> wrote: >> >>> I don't think current Spark Streaming supports window operations which >>> beyond its available memory, internally Spark Streaming puts all the data >>> in the memory belongs to the effective window, if the memory is not enough, >>> BlockManager will discard the blocks at LRU policy, so something unexpected >>> will be occurred. >>> >>> Thanks >>> Jerry >>> >>> -Original Message- >>> From: avilevi3 [mailto:avile...@gmail.com] >>> Sent: Monday, February 23, 2015 12:57 AM >>> To: user@spark.apache.org >>> Subject: spark streaming window operations on a large window size >>> >>> Hi guys, >>> >>> does spark streaming supports window operations on a sliding window that >>> is data is larger than the available memory? >>> we would like to >>> currently we are using kafka as input, but we could change that if >>> needed. >>> >>> thanks >>> Avi >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>> additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: spark streaming window operations on a large window size
OK - thanks a lot On Tue, Feb 24, 2015 at 9:49 AM, Tathagata Das wrote: > Yes. > > On Mon, Feb 23, 2015 at 11:16 PM, Avi Levi wrote: > >> @Tathagata Das so basically you are saying it is supported out of the >> box, but we should expect a significant performance hit - is that right? >> >> >> >> On Tue, Feb 24, 2015 at 5:37 AM, Tathagata Das >> wrote: >> >>> The default persistence level is MEMORY_AND_DISK, so the LRU policy >>> would discard the blocks to disk, so the streaming app will not fail. >>> However, since things will get constantly read in and out of disk as >>> windows are processed, the performance wont be great. So it is best to have >>> sufficient memory to keep all the window data in memory. >>> >>> TD >>> >>> On Mon, Feb 23, 2015 at 8:26 AM, Shao, Saisai >>> wrote: >>> >>>> I don't think current Spark Streaming supports window operations which >>>> beyond its available memory, internally Spark Streaming puts all the data >>>> in the memory belongs to the effective window, if the memory is not enough, >>>> BlockManager will discard the blocks at LRU policy, so something unexpected >>>> will be occurred. >>>> >>>> Thanks >>>> Jerry >>>> >>>> -Original Message- >>>> From: avilevi3 [mailto:avile...@gmail.com] >>>> Sent: Monday, February 23, 2015 12:57 AM >>>> To: user@spark.apache.org >>>> Subject: spark streaming window operations on a large window size >>>> >>>> Hi guys, >>>> >>>> does spark streaming supports window operations on a sliding window >>>> that is data is larger than the available memory? >>>> we would like to >>>> currently we are using kafka as input, but we could change that if >>>> needed. >>>> >>>> thanks >>>> Avi >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>>> additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >
Checkpointing fro reduceByKeyAndWindow with a window size of 1 hour and 24 hours
Hi, What happens if I dont specify checkpointing on a DStream that has reduceByKeyAndWindow with no inverse function? Would it cause the memory to be overflown? My window sizes are 1 hour and 24 hours. I cannot provide an inserse function for this as it is based on HyperLogLog. My code looks like something like the following: val logsByPubGeo = messages.map(_._2).filter(_.geo != Constants.UnknownGeo).map { log => val key = PublisherGeoKey(log.publisher, log.geo) val agg = AggregationLog( timestamp = log.timestamp, sumBids = log.bid, imps = 1, uniquesHll = hyperLogLog(log.cookie.getBytes(Charsets.UTF_8)) ) (key, agg) } val aggLogs = logsByPubGeo.reduceByKeyAndWindow(reduceAggregationLogs, BatchDuration) private def reduceAggregationLogs(aggLog1: AggregationLog, aggLog2: AggregationLog) = { aggLog1.copy( timestamp = math.min(aggLog1.timestamp, aggLog2.timestamp), sumBids = aggLog1.sumBids + aggLog2.sumBids, imps = aggLog1.imps + aggLog2.imps, uniquesHll = aggLog1.uniquesHll + aggLog2.uniquesHll ) } Please let me know. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-fro-reduceByKeyAndWindow-with-a-window-size-of-1-hour-and-24-hours-tp28722.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org