Re: Spark Streaming question batch size

2014-07-01 Thread Yana Kadiyska
Are you saying that both streams come in at the same rate and you have
the same batch interval but the batch size ends up different? i.e. two
datapoints both arriving at X seconds after streaming starts end up in
two different batches? How do you define real time values for both
streams? I am trying to do something similar to you, I think -- but
I'm not clear on what your notion of time is.
My reading of your example above is that the streams just pump data in
at different rates -- first one got 7462 points in the first batch
interval, whereas stream2 saw 10493

On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
 Hi,

 The window size in a spark streaming is time based which means we have
 different number of elements in each window. For example if you have two
 streams (might be more) which are related to each other and you want to
 compare them in a specific time interval. I am not clear how it will work.
 Although they start running simultaneously, they might have different number
 of elements in each time interval.

 The following is output for two streams which have same number of elements
 and ran simultaneously. The left most value is the number of elements in
 each window. If we add the number of elements them, they are same for both
 streams but we can't compare both streams as they are different in window
 size and number of windows.

 Can we somehow make windows based on real time values for both streams? or
 Can we make windows based on number of elements?

 (n, (mean, varience, SD))
 Stream 1
 (7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
 (44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
 (245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
 (154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
 (156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
 (156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
 (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))

 Stream 2
 (10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
 (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
 (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
 (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
 (269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
 (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))

 Regards,
 Laeeq




Re: Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi Yana,

Yes, that is what I am saying. I need both streams to be at same pace. I do 
have timestamps for each datapoint. There is a way suggested by Tathagata das 
in an earlier post where you have have a bigger window than required and you 
fetch your required data from that window based on your timestamps. I was just 
looking if there are other cleaner ways to do it.

Regards
Laeeq
 


On Tuesday, July 1, 2014 4:23 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
 


Are you saying that both streams come in at the same rate and you have
the same batch interval but the batch size ends up different? i.e. two
datapoints both arriving at X seconds after streaming starts end up in
two different batches? How do you define real time values for both
streams? I am trying to do something similar to you, I think -- but
I'm not clear on what your notion of time is.
My reading of your example above is that the streams just pump data in
at different rates -- first one got 7462 points in the first batch
interval, whereas stream2 saw 10493


On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
 Hi,

 The window size in a spark streaming is time based which means we have
 different number of elements in each window. For example if you have two
 streams (might be more) which are related to each other and you want to
 compare them in a specific time interval. I am not clear how it will work.
 Although they start running simultaneously, they might have different number
 of elements in each time interval.

 The following is output for two streams which have same number of elements
 and ran simultaneously. The left most value is the number of elements in
 each window. If we add the number of elements them, they are same for both
 streams but we can't compare both streams as they are different in window
 size and number of windows.

 Can we somehow make windows based on real time values for both streams? or
 Can we make windows based on number of elements?

 (n, (mean, varience, SD))
 Stream 1
 (7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
 (44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
 (245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
 (154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
 (156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
 (156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
 (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))

 Stream 2
 (10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
 (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
 (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
 (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
 (269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
 (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))

 Regards,
 Laeeq