Re: sliding Top N window

Lars Albertsson Mon, 21 Mar 2016 12:51:06 -0700

Hi,

If you can accept approximate top N results, there is a neat solution
for this problem: Use an approximate Map<K, V> structure called
Count-Min Sketch, in combination with a list of the M top items, where
M > N. When you encounter an item not in the top M, you look up its
count in the Count-Min Sketch do determine whether it qualifies.

You will need to break down your event stream into time windows with a
certain time unit, e.g. minutes or hours, and keep one Count-Min
Sketch for each unit. The CMSs can be added, so you aggregate them to
form your sliding windows. You also keep a top M (aka "heavy hitters")
list for each window.

The data structures required are surprisingly small, and will likely
fit in memory on a single machine, if it can handle the traffic
volume, so you might not need Spark at all. If you choose to use Spark
in order to benefit from windowing, be aware that Spark lumps events
in micro batches based on processing time, not event time.

I made a presentation on approximate counting a couple of years ago.
Slides and video here:
http://www.slideshare.net/lallea/scalable-real-time-processing-techniques-39990105.
You can also search for presentation by Ted Dunning and Mikio Braun,
who have held good presentations on the subject.

There are AFAIK two open source implementations of Count-Min Sketch,
one of them in Algebird.

Let me know if anything is unclear.

Good luck, and let us know how it goes.

Regards,

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109

On Fri, Mar 11, 2016 at 9:09 PM, Yakubovich, Alexey
<alexey.yakubov...@searshc.com> wrote:
> Good day,
>
> I have a following task: a stream of “page vies” coming to kafka topic. Each
> view contains list of product Ids from a visited page. The task: to have in
> “real time” Top N product.
>
> I am interested in some solution that would require minimum intermediate
> writes … So  need to build a sliding window for top N product, where the
> product counters dynamically changes and window should present the TOP
> product for the specified period of time.
>
> I believe there is no way to avoid maintaining all product counters counters
> in memory/storage.  But at least I would like to do all logic, all
> calculation on a fly, in memory, not spilling multiple RDD from memory to
> disk.
>
> So I believe I see one way of doing it:
>    Take, msg from kafka take and line up, all elementary action (increase by
> 1 the counter for the product PID )
>   Each action will be implemented as a call to HTable.increment()  // or
> easier, with incrementColumnValue()…
>   After each increment I can apply my own operation “offer” would provide
> that only top N products with counters are kept in another Hbase table (also
> with atomic operations).
>  But there is another stream of events: decreasing product counters when
> view expires the legth of sliding window….
>
> So my question: does anybody know/have and can share the piece code/ know
> how: how to implement “sliding Top N window” better.
> If nothing will be offered, I will share what I will do myself.
>
> Thank you
> Alexey
> This message, including any attachments, is the property of Sears Holdings
> Corporation and/or one of its subsidiaries. It is confidential and may
> contain proprietary or legally privileged information. If you are not the
> intended recipient, please delete it without reading the contents. Thank
> you.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sliding Top N window

Reply via email to