Re: sliding Top N window

Jatin Kumar Tue, 22 Mar 2016 19:41:49 -0700

Lar, can you please point to an example?
On Mar 23, 2016 2:16 AM, "Lars Albertsson" <la...@mapflat.com> wrote:


> @Jatin, I touched that case briefly in the linked presentation.
>
> You will have to decide on a time slot size, and then aggregate slots
> to form windows. E.g. if you select a time slot of an hour, you build
> a CMS and a heavy hitter list for the current hour slot, and start new
> ones at 00 minutes. In order to form e.g. a 12 hour window, the
> 12-hour CMS is calculated as the sum of the 12 hour slot CMSs, and the
> 12-hour heavy hitters is the union of the hour slot heavy hitters.
>
> Since the data structures are small, one can afford using small time
> slots. One can also keep a long history with different combinations of
> time windows by pushing out CMSs and heavy hitters to e.g. Kafka, and
> have different stream processors that aggregate different time windows
> and push results to Kafka or to lookup tables.
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> +46 70 7687109
>
>
> On Tue, Mar 22, 2016 at 1:23 PM, Jatin Kumar <jku...@rocketfuelinc.com>
> wrote:
> > Hello Yakubovich,
> >
> > I have been looking into a similar problem. @Lars please note that he
> wants
> > to maintain the top N products over a sliding window, whereas the
> > CountMinSketh algorithm is useful if we want to maintain global top N
> > products list. Please correct me if I am wrong here.
> >
> > I tried using CountMinSketch and realized that it doesn't suit my use
> case
> > as I also wanted to maintain top N over last H hours. CountMinSketch has
> no
> > notion of time, so in my understanding you cannot use that.
> >
> > Yakubovich, you can try doing something like this:
> >
> > val stream = <DStream from your source>
> > // I am assuming that each entry is a comma separated list of product ids
> > // and product ID is a string (doesn't really matter though)
> > stream
> >  .flatMap(record => record.split(","))
> >  .map(pid => (pid, 1L))
> >  .reduceByKeyAndWindow(_ + _, _ - _, Seconds(S1), Seconds(S2))
> >  .foreachRDD(rdd => {
> >    // `rdd` here is of type (pid, count) and has frequency of each PID
> over
> >    // a sliding window of S1 seconds which moves by S2 seconds every
> time.
> >
> >    implicit val order = new scala.Ordering[(String, Long)] {
> >      override def compare(a1: (String, Long), a2: (String, Long)):
> Boolean =
> > a1._2 > a2._2
> >    }
> >
> >    val topNPidTuples = rdd.top(N)
> >    // do whatever you want here.
> >  })
> >
> >
> >
> > --
> > Thanks
> > Jatin
> >
> > On Tue, Mar 22, 2016 at 12:11 PM, Rishi Mishra <rmis...@snappydata.io>
> > wrote:
> >>
> >> Hi Alexy,
> >> We are also trying to solve similar problems using approximation. Would
> >> like to hear more about your usage.  We can discuss this offline without
> >> boring others.  :)
> >>
> >> Regards,
> >> Rishitesh Mishra,
> >> SnappyData . (http://www.snappydata.io/)
> >>
> >> https://in.linkedin.com/in/rishiteshmishra
> >>
> >> On Tue, Mar 22, 2016 at 1:19 AM, Lars Albertsson <la...@mapflat.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> If you can accept approximate top N results, there is a neat solution
> >>> for this problem: Use an approximate Map<K, V> structure called
> >>> Count-Min Sketch, in combination with a list of the M top items, where
> >>> M > N. When you encounter an item not in the top M, you look up its
> >>> count in the Count-Min Sketch do determine whether it qualifies.
> >>>
> >>> You will need to break down your event stream into time windows with a
> >>> certain time unit, e.g. minutes or hours, and keep one Count-Min
> >>> Sketch for each unit. The CMSs can be added, so you aggregate them to
> >>> form your sliding windows. You also keep a top M (aka "heavy hitters")
> >>> list for each window.
> >>>
> >>> The data structures required are surprisingly small, and will likely
> >>> fit in memory on a single machine, if it can handle the traffic
> >>> volume, so you might not need Spark at all. If you choose to use Spark
> >>> in order to benefit from windowing, be aware that Spark lumps events
> >>> in micro batches based on processing time, not event time.
> >>>
> >>> I made a presentation on approximate counting a couple of years ago.
> >>> Slides and video here:
> >>>
> >>>
> http://www.slideshare.net/lallea/scalable-real-time-processing-techniques-39990105
> .
> >>> You can also search for presentation by Ted Dunning and Mikio Braun,
> >>> who have held good presentations on the subject.
> >>>
> >>> There are AFAIK two open source implementations of Count-Min Sketch,
> >>> one of them in Algebird.
> >>>
> >>> Let me know if anything is unclear.
> >>>
> >>> Good luck, and let us know how it goes.
> >>>
> >>> Regards,
> >>>
> >>>
> >>>
> >>> Lars Albertsson
> >>> Data engineering consultant
> >>> www.mapflat.com
> >>> +46 70 7687109
> >>>
> >>>
> >>> On Fri, Mar 11, 2016 at 9:09 PM, Yakubovich, Alexey
> >>> <alexey.yakubov...@searshc.com> wrote:
> >>> > Good day,
> >>> >
> >>> > I have a following task: a stream of “page vies” coming to kafka
> topic.
> >>> > Each
> >>> > view contains list of product Ids from a visited page. The task: to
> >>> > have in
> >>> > “real time” Top N product.
> >>> >
> >>> > I am interested in some solution that would require minimum
> >>> > intermediate
> >>> > writes … So  need to build a sliding window for top N product, where
> >>> > the
> >>> > product counters dynamically changes and window should present the
> TOP
> >>> > product for the specified period of time.
> >>> >
> >>> > I believe there is no way to avoid maintaining all product counters
> >>> > counters
> >>> > in memory/storage.  But at least I would like to do all logic, all
> >>> > calculation on a fly, in memory, not spilling multiple RDD from
> memory
> >>> > to
> >>> > disk.
> >>> >
> >>> > So I believe I see one way of doing it:
> >>> >    Take, msg from kafka take and line up, all elementary action
> >>> > (increase by
> >>> > 1 the counter for the product PID )
> >>> >   Each action will be implemented as a call to HTable.increment()  //
> >>> > or
> >>> > easier, with incrementColumnValue()…
> >>> >   After each increment I can apply my own operation “offer” would
> >>> > provide
> >>> > that only top N products with counters are kept in another Hbase
> table
> >>> > (also
> >>> > with atomic operations).
> >>> >  But there is another stream of events: decreasing product counters
> >>> > when
> >>> > view expires the legth of sliding window….
> >>> >
> >>> > So my question: does anybody know/have and can share the piece code/
> >>> > know
> >>> > how: how to implement “sliding Top N window” better.
> >>> > If nothing will be offered, I will share what I will do myself.
> >>> >
> >>> > Thank you
> >>> > Alexey
> >>> > This message, including any attachments, is the property of Sears
> >>> > Holdings
> >>> > Corporation and/or one of its subsidiaries. It is confidential and
> may
> >>> > contain proprietary or legally privileged information. If you are not
> >>> > the
> >>> > intended recipient, please delete it without reading the contents.
> >>> > Thank
> >>> > you.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >>
> >
>

Re: sliding Top N window

Reply via email to