Re: [algogeeks] top search queries

2011-01-31 Thread snehal jain
please give your ideas for this

On Fri, Jan 21, 2011 at 2:28 PM, snehal jain learner@gmail.com wrote:

 Magy! receives a huge amount of queries everyday. They don’t want to
 store all of them, since many of the queries are unique (submitted
 just one time by just one user). Anyway, for caching reason they want
 to understand what are the queries whose frequency exceed s=0.1%. How
 do you solve the problem? Remember we don’t want to store all the
 queries.
 Hint: split the stream into windows and accept some error in
 estimation.

 --
 You received this message because you are subscribed to the Google Groups
 Algorithm Geeks group.
 To post to this group, send email to algogeeks@googlegroups.com.
 To unsubscribe from this group, send email to
 algogeeks+unsubscr...@googlegroups.comalgogeeks%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/algogeeks?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
Algorithm Geeks group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



Re: [algogeeks] top search queries

2011-01-31 Thread Srikar
First Approach,

0) the queries can be considered as an infinite stream. maintain a global
count of the number of queries coming from the stream (used for calc the
frequency %).
1) maintain a min-heap (has just top 100 queries by frequency) + hash table
(where we store frequency for each word in heap). the element at the top of
the heap is the query having min-frequency.
2) once we get a query. we check if it is present in the hashtable. if yes
then update this query frequency  reorder the heap. if no then do nothing.

here I assume we want top 100 most queries. this can be easily extended to
include all queries having  0.1% frequency.

Complexity:
time: O(logn) + O(n) [for heap  hashtable construction] here n=100
space: 2O(n)

Will come up with something better. If extend n to say 1,00,000, the space
complexity is going to be a problem.


Second Approach,

0) We could use only one min-heap. Each node has the (query, frequency %).
1) get the query from the stream. Traverse the heap to see if this query is
already present. If yes then re-calculate the frequency  if needed reorder
the heap. if query not present in the heap, then calc it's frequency  see
if the frequency of the node at min-heap is  the current query freq. if
(curr_query_freq  min-heap node freq.) then swap the min-heap node 
reorder the heap. else continue.

Time: O(logn) n=number of queries we want to consider.
space: O(n)

Srikar



On Mon, Jan 31, 2011 at 6:57 PM, snehal jain learner@gmail.com wrote:

 please give your ideas for this


 On Fri, Jan 21, 2011 at 2:28 PM, snehal jain learner@gmail.comwrote:

 Magy! receives a huge amount of queries everyday. They don’t want to
 store all of them, since many of the queries are unique (submitted
 just one time by just one user). Anyway, for caching reason they want
 to understand what are the queries whose frequency exceed s=0.1%. How
 do you solve the problem? Remember we don’t want to store all the
 queries.
 Hint: split the stream into windows and accept some error in
 estimation.

 --
 You received this message because you are subscribed to the Google Groups
 Algorithm Geeks group.
 To post to this group, send email to algogeeks@googlegroups.com.
 To unsubscribe from this group, send email to
 algogeeks+unsubscr...@googlegroups.comalgogeeks%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/algogeeks?hl=en.


  --
 You received this message because you are subscribed to the Google Groups
 Algorithm Geeks group.
 To post to this group, send email to algogeeks@googlegroups.com.
 To unsubscribe from this group, send email to
 algogeeks+unsubscr...@googlegroups.comalgogeeks%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/algogeeks?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
Algorithm Geeks group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



Re: [algogeeks] top search queries

2011-01-31 Thread juver++
@above in your second approach, in the worst case you need to traverse the 
heap in O(n) time.

-- 
You received this message because you are subscribed to the Google Groups 
Algorithm Geeks group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] top search queries

2011-01-21 Thread snehal jain
Magy! receives a huge amount of queries everyday. They don’t want to
store all of them, since many of the queries are unique (submitted
just one time by just one user). Anyway, for caching reason they want
to understand what are the queries whose frequency exceed s=0.1%. How
do you solve the problem? Remember we don’t want to store all the
queries.
Hint: split the stream into windows and accept some error in
estimation.

-- 
You received this message because you are subscribed to the Google Groups 
Algorithm Geeks group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.