Window Size

2014-07-01 Thread Laeeq Ahmed
Hi,

The window size in a spark streaming is time based which means we have 
different number of elements in each window. For example if you have two 
streams (might be more) which are related to each other and you want to compare 
them in a specific time interval. I am not clear how it will 
work. Although they start running simultaneously, they might have 
different number of elements in each time interval.

The following is output for two streams which have same number of elements 
and ran simultaneously. The left most value is the number of elements in each 
window. If we add the number of elements them, they are same for 
both streams but we can't compare both streams as they are different in 
window size and number of windows.

Can we somehow make windows based on real time values for both streams? or Can 
we make windows based on number of elements?

(n, (mean, varience, SD))

Stream 1

(7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
(44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
(245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
(154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
(156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
(156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
(156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))


Stream 2

(10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
(180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
(179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
(179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
(269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
(101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))


Regards,Laeeq


spark streaming window operations on a large window size

2015-02-23 Thread avilevi3
Hi guys, 

does spark streaming supports window operations on a sliding window that is
data is larger than the available memory?
we would like to 
currently we are using kafka as input, but we could change that if needed.

thanks
Avi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark streaming window operations on a large window size

2015-02-23 Thread Shao, Saisai
I don't think current Spark Streaming supports window operations which beyond 
its available memory, internally Spark Streaming puts all the data in the 
memory belongs to the effective window, if the memory is not enough, 
BlockManager will discard the blocks at LRU policy, so something unexpected 
will be occurred.

Thanks
Jerry

-Original Message-
From: avilevi3 [mailto:avile...@gmail.com] 
Sent: Monday, February 23, 2015 12:57 AM
To: user@spark.apache.org
Subject: spark streaming window operations on a large window size

Hi guys, 

does spark streaming supports window operations on a sliding window that is 
data is larger than the available memory?
we would like to
currently we are using kafka as input, but we could change that if needed.

thanks
Avi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming window operations on a large window size

2015-02-23 Thread Tathagata Das
The default persistence level is MEMORY_AND_DISK, so the LRU policy would
discard the blocks to disk, so the streaming app will not fail. However,
since things will get constantly read in and out of disk as windows are
processed, the performance wont be great. So it is best to have sufficient
memory to keep all the window data in memory.

TD

On Mon, Feb 23, 2015 at 8:26 AM, Shao, Saisai  wrote:

> I don't think current Spark Streaming supports window operations which
> beyond its available memory, internally Spark Streaming puts all the data
> in the memory belongs to the effective window, if the memory is not enough,
> BlockManager will discard the blocks at LRU policy, so something unexpected
> will be occurred.
>
> Thanks
> Jerry
>
> -Original Message-
> From: avilevi3 [mailto:avile...@gmail.com]
> Sent: Monday, February 23, 2015 12:57 AM
> To: user@spark.apache.org
> Subject: spark streaming window operations on a large window size
>
> Hi guys,
>
> does spark streaming supports window operations on a sliding window that
> is data is larger than the available memory?
> we would like to
> currently we are using kafka as input, but we could change that if needed.
>
> thanks
> Avi
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark streaming window operations on a large window size

2015-02-23 Thread Avi Levi
@Tathagata Das so basically you are saying it is supported out of the box,
but we should expect a significant performance hit - is that right?



On Tue, Feb 24, 2015 at 5:37 AM, Tathagata Das  wrote:

> The default persistence level is MEMORY_AND_DISK, so the LRU policy would
> discard the blocks to disk, so the streaming app will not fail. However,
> since things will get constantly read in and out of disk as windows are
> processed, the performance wont be great. So it is best to have sufficient
> memory to keep all the window data in memory.
>
> TD
>
> On Mon, Feb 23, 2015 at 8:26 AM, Shao, Saisai 
> wrote:
>
>> I don't think current Spark Streaming supports window operations which
>> beyond its available memory, internally Spark Streaming puts all the data
>> in the memory belongs to the effective window, if the memory is not enough,
>> BlockManager will discard the blocks at LRU policy, so something unexpected
>> will be occurred.
>>
>> Thanks
>> Jerry
>>
>> -Original Message-
>> From: avilevi3 [mailto:avile...@gmail.com]
>> Sent: Monday, February 23, 2015 12:57 AM
>> To: user@spark.apache.org
>> Subject: spark streaming window operations on a large window size
>>
>> Hi guys,
>>
>> does spark streaming supports window operations on a sliding window that
>> is data is larger than the available memory?
>> we would like to
>> currently we are using kafka as input, but we could change that if needed.
>>
>> thanks
>> Avi
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
>> commands, e-mail: user-h...@spark.apache.org
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: spark streaming window operations on a large window size

2015-02-23 Thread Tathagata Das
Yes.

On Mon, Feb 23, 2015 at 11:16 PM, Avi Levi  wrote:

> @Tathagata Das so basically you are saying it is supported out of the
> box, but we should expect a significant performance hit - is that right?
>
>
>
> On Tue, Feb 24, 2015 at 5:37 AM, Tathagata Das 
> wrote:
>
>> The default persistence level is MEMORY_AND_DISK, so the LRU policy would
>> discard the blocks to disk, so the streaming app will not fail. However,
>> since things will get constantly read in and out of disk as windows are
>> processed, the performance wont be great. So it is best to have sufficient
>> memory to keep all the window data in memory.
>>
>> TD
>>
>> On Mon, Feb 23, 2015 at 8:26 AM, Shao, Saisai 
>> wrote:
>>
>>> I don't think current Spark Streaming supports window operations which
>>> beyond its available memory, internally Spark Streaming puts all the data
>>> in the memory belongs to the effective window, if the memory is not enough,
>>> BlockManager will discard the blocks at LRU policy, so something unexpected
>>> will be occurred.
>>>
>>> Thanks
>>> Jerry
>>>
>>> -Original Message-
>>> From: avilevi3 [mailto:avile...@gmail.com]
>>> Sent: Monday, February 23, 2015 12:57 AM
>>> To: user@spark.apache.org
>>> Subject: spark streaming window operations on a large window size
>>>
>>> Hi guys,
>>>
>>> does spark streaming supports window operations on a sliding window that
>>> is data is larger than the available memory?
>>> we would like to
>>> currently we are using kafka as input, but we could change that if
>>> needed.
>>>
>>> thanks
>>> Avi
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>> additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: spark streaming window operations on a large window size

2015-02-24 Thread Avi Levi
OK - thanks a lot

On Tue, Feb 24, 2015 at 9:49 AM, Tathagata Das  wrote:

> Yes.
>
> On Mon, Feb 23, 2015 at 11:16 PM, Avi Levi  wrote:
>
>> @Tathagata Das so basically you are saying it is supported out of the
>> box, but we should expect a significant performance hit - is that right?
>>
>>
>>
>> On Tue, Feb 24, 2015 at 5:37 AM, Tathagata Das 
>> wrote:
>>
>>> The default persistence level is MEMORY_AND_DISK, so the LRU policy
>>> would discard the blocks to disk, so the streaming app will not fail.
>>> However, since things will get constantly read in and out of disk as
>>> windows are processed, the performance wont be great. So it is best to have
>>> sufficient memory to keep all the window data in memory.
>>>
>>> TD
>>>
>>> On Mon, Feb 23, 2015 at 8:26 AM, Shao, Saisai 
>>> wrote:
>>>
>>>> I don't think current Spark Streaming supports window operations which
>>>> beyond its available memory, internally Spark Streaming puts all the data
>>>> in the memory belongs to the effective window, if the memory is not enough,
>>>> BlockManager will discard the blocks at LRU policy, so something unexpected
>>>> will be occurred.
>>>>
>>>> Thanks
>>>> Jerry
>>>>
>>>> -Original Message-
>>>> From: avilevi3 [mailto:avile...@gmail.com]
>>>> Sent: Monday, February 23, 2015 12:57 AM
>>>> To: user@spark.apache.org
>>>> Subject: spark streaming window operations on a large window size
>>>>
>>>> Hi guys,
>>>>
>>>> does spark streaming supports window operations on a sliding window
>>>> that is data is larger than the available memory?
>>>> we would like to
>>>> currently we are using kafka as input, but we could change that if
>>>> needed.
>>>>
>>>> thanks
>>>> Avi
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-window-operations-on-a-large-window-size-tp21764.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>>> additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Checkpointing fro reduceByKeyAndWindow with a window size of 1 hour and 24 hours

2017-05-30 Thread SRK
Hi,

What happens if I dont specify checkpointing on a DStream that has
reduceByKeyAndWindow  with no inverse function? Would it cause the memory to
be overflown? My window sizes are 1 hour and 24 hours.
I cannot provide an inserse function for this as it is based on HyperLogLog.

My code looks like something like the following:

  val logsByPubGeo = messages.map(_._2).filter(_.geo !=
Constants.UnknownGeo).map {
log =>
  val key = PublisherGeoKey(log.publisher, log.geo)
  val agg = AggregationLog(
timestamp = log.timestamp,
sumBids = log.bid,
imps = 1,
uniquesHll = hyperLogLog(log.cookie.getBytes(Charsets.UTF_8))
  )
  (key, agg)
  }


 val aggLogs = logsByPubGeo.reduceByKeyAndWindow(reduceAggregationLogs,
BatchDuration)

   private def reduceAggregationLogs(aggLog1: AggregationLog, aggLog2:
AggregationLog) = {
aggLog1.copy(
  timestamp = math.min(aggLog1.timestamp, aggLog2.timestamp),
  sumBids = aggLog1.sumBids + aggLog2.sumBids,
  imps = aggLog1.imps + aggLog2.imps,
  uniquesHll = aggLog1.uniquesHll + aggLog2.uniquesHll
)
  }


Please let me know.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-fro-reduceByKeyAndWindow-with-a-window-size-of-1-hour-and-24-hours-tp28722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org