[algogeeks] Re: 10 most repeating words

2010-10-24 Thread Chi
>From Wikipedia:

>Thus the MapReduce framework transforms a list of (key, value) pairs into a 
>list of values. This behavior is different from the functional programming map 
>and reduce combination, >which accepts a list of arbitrary values and returns 
>one single value that combines all the values returned by map.

And what would be the difference to use a parallel algorithm such as
Prim's or Dijkstra!?

On Oct 22, 7:11 pm, Kishen Das  wrote:
> MapReduce is the best option .
>
> For the word count its explained here -http://en.wikipedia.org/wiki/MapReduce
>
> Interesting thing is that the Map step can easily be made parallel.
>
> Once again I request the members of this group to go through all the
> parallel constructs. ( Parallel sorting, Parallel collections, etc )
> Its cool to optimize sequential programs, but with GPUs and ever increasing
> cores, you should start thinking in terms of parallelizing your code.
>
> Kishen
>
> On Fri, Oct 22, 2010 at 9:24 AM, ligerdave  wrote:
> > for a large file, you probably would want to use external sort. kinda
> > like a map-reduce concept. it's actually how sort&uniq kinda stuff
> > work in unix/linux when you try to find some "TOP X"
>
> > again, we are talking about the memory might not hold the entire file
>
> > On Oct 21, 9:35 am, "Vinay..."  wrote:
> > > how do u find 10 most repeating words on a large file containing words
> > > in most efficient way...if it can also be done using heapsort plz post
> > > ur answers..
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Algorithm Geeks" group.
> > To post to this group, send email to algoge...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > algogeeks+unsubscr...@googlegroups.com
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/algogeeks?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



Re: [algogeeks] Re: 10 most repeating words

2010-10-24 Thread Kishen Das
MapReduce is the best option .

For the word count its explained here -
http://en.wikipedia.org/wiki/MapReduce

Interesting thing is that the Map step can easily be made parallel.

Once again I request the members of this group to go through all the
parallel constructs. ( Parallel sorting, Parallel collections, etc )
Its cool to optimize sequential programs, but with GPUs and ever increasing
cores, you should start thinking in terms of parallelizing your code.

Kishen

On Fri, Oct 22, 2010 at 9:24 AM, ligerdave  wrote:

> for a large file, you probably would want to use external sort. kinda
> like a map-reduce concept. it's actually how sort&uniq kinda stuff
> work in unix/linux when you try to find some "TOP X"
>
> again, we are talking about the memory might not hold the entire file
>
> On Oct 21, 9:35 am, "Vinay..."  wrote:
> > how do u find 10 most repeating words on a large file containing words
> > in most efficient way...if it can also be done using heapsort plz post
> > ur answers..
>
> --
> You received this message because you are subscribed to the Google Groups
> "Algorithm Geeks" group.
> To post to this group, send email to algoge...@googlegroups.com.
> To unsubscribe from this group, send email to
> algogeeks+unsubscr...@googlegroups.com
> .
> For more options, visit this group at
> http://groups.google.com/group/algogeeks?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: 10 most repeating words

2010-10-24 Thread Chi
The biggest weakness of the trie is the amount of space it wastes -
chances are there will be runs of characters with a word only at the
end, meaning a bunch of extra levels of the tree that aren't needed.
The PATRICIA trie, or radix trie, attempts to address this by allowing
the 'decision' at each note to be on something other than the first
character, so each node stores an offset and the branch options (e.g.
if the letter in position 5 is A, go down this branch)

On Oct 23, 9:13 pm, Chi  wrote:
> I've written a kart-trie in php. You can easily extend yourself the
> payload to count the word frequency.
>
> >http://sourceforge.net/projects/kart-trie
>
> After you build your dictionary from your large file, you can easily
> find the highest frequency be recursively search the trie. It should
> be faster then a hash-Table, because the kart-trie is like an
> unbalance binary trie. The only difference between a kart-trie and a
> radix-trie, or a crip-bit-trie, is that it uses a hash-key to identify
> the payload. That makes is suitable for other data as well.
>
> On Oct 21, 3:35 pm, "Vinay..."  wrote:
>
> > how do u find 10 most repeating words on a large file containing words
> > in most efficient way...if it can also be done using heapsort plz post
> > ur answers..

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: 10 most repeating words

2010-10-24 Thread ligerdave
@Dave
I hear ya. Im just saying in general, you would wanna have an
algorithm for all cases.
of coz, in this case, if the RAM isn't a consideration and heapsort is
what @Vinay wants, i guess we are coming up w/ one like that.
again, in general, you don't wanna have one version of program for
king james and another for something that's more right?

btw, do you have any new clue on the nth largest sum of two arrays? i
realized the solution i gave wasn't working for all cases. im on the
right track i think. however, there is something must be fixed and im
scratching my head now. :) let me know man!

On Oct 22, 11:19 pm, Dave  wrote:
> @Ligerdave: Hey, the King James version of the Bible is only about
> 600,000 words. I use the Bible as an example only because it is a
> fairly large book. Maybe we are talking 10 megabytes to store the
> whole thing, seeing that there are some long words such as "Maher-
> shalal-hash-baz," a name that occurs in Isaiah 8:3. Ten megabytes
> hardly seems "large," when compared to the 4 or 8 gigabytes or more of
> RAM on many computers. Besides, you don't have to keep all of the text
> in memory, but only the distinct words and an integer count of the
> number of occurrences. For the King James bible, this is less than
> 5,000 words, so we're talking a pretty minimal amount of memory. A
> hash table might work fine for this, or build a balanced binary tree
> of the words. After you have scanned all of the input text and
> determined the number of occurrences of each word, it is fairly easy
> to scan the word counts and pick out the ten largest.
>
> Dave
>
> On Oct 22, 9:24 am, ligerdave  wrote:
>
>
>
>
>
>
>
> > for a large file, you probably would want to use external sort. kinda
> > like a map-reduce concept. it's actually how sort&uniq kinda stuff
> > work in unix/linux when you try to find some "TOP X"
>
> > again, we are talking about the memory might not hold the entire file
>
> > On Oct 21, 9:35 am, "Vinay..."  wrote:
>
> > > how do u find 10 most repeating words on a large file containing words
> > > in most efficient way...if it can also be done using heapsort plz post
> > > ur answers..- Hide quoted text -
>
> > - Show quoted text -

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: 10 most repeating words

2010-10-23 Thread Chi
The biggest weakness of the trie is the amount of space it wastes -
chances are there will be runs of characters with a word only at the
end, meaning a bunch of extra levels of the tree that aren't needed.
The PATRICIA trie, or radix trie, attempts to address this by allowing
the 'decision' at each note to be on something other than the first
character, so each node stores an offset and the branch options (e.g.
if the letter in position 5 is A, go down this branch).

On 23 Okt., 21:13, Chi  wrote:
> I've written a kart-trie in php. You can easily extend yourself the
> payload to count the word frequency.
>
> >http://sourceforge.net/projects/kart-trie
>
> After you build your dictionary from your large file, you can easily
> find the highest frequency be recursively search the trie. It should
> be faster then a hash-Table, because the kart-trie is like an
> unbalance binary trie. The only difference between a kart-trie and a
> radix-trie, or a crip-bit-trie, is that it uses a hash-key to identify
> the payload. That makes is suitable for other data as well.
>
> On Oct 21, 3:35 pm, "Vinay..."  wrote:
>
> > how do u find 10 most repeating words on a large file containing words
> > in most efficient way...if it can also be done using heapsort plz post
> > ur answers..

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: 10 most repeating words

2010-10-23 Thread Chi
I've written a kart-trie in php. You can easily extend yourself the
payload to count the word frequency.

> http://sourceforge.net/projects/kart-trie

After you build your dictionary from your large file, you can easily
find the highest frequency be recursively search the trie. It should
be faster then a hash-Table, because the kart-trie is like an
unbalance binary trie. The only difference between a kart-trie and a
radix-trie, or a crip-bit-trie, is that it uses a hash-key to identify
the payload. That makes is suitable for other data as well.


On Oct 21, 3:35 pm, "Vinay..."  wrote:
> how do u find 10 most repeating words on a large file containing words
> in most efficient way...if it can also be done using heapsort plz post
> ur answers..

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



Re: [algogeeks] Re: 10 most repeating words

2010-10-23 Thread preetika tyagi
Use max heap of all the distinct numbers and order them in decreasing order
of their frequency.

On Sat, Oct 23, 2010 at 7:31 AM, Mridul Malpani wrote:

> use 2 datstructure : TRIE and an array of words along with their
> frequency.
>
> we can solve it in following step:
>
> 1) scan each word of the large file. insert the current word into the
> TRIE if not present already. update the frequency.
>using TRIE, the time complexity is O(L), where l is the total no.
> of characters in file.
>
> 2) now, you have build up the array of words with frequency. the size
> of array = n = no. of words in file.
>get k maximum frequency in O(nlgk). here k=10. ( use min heap of k
> elements to get this.)
>
> On Oct 23, 8:19 am, Dave  wrote:
> > @Ligerdave: Hey, the King James version of the Bible is only about
> > 600,000 words. I use the Bible as an example only because it is a
> > fairly large book. Maybe we are talking 10 megabytes to store the
> > whole thing, seeing that there are some long words such as "Maher-
> > shalal-hash-baz," a name that occurs in Isaiah 8:3. Ten megabytes
> > hardly seems "large," when compared to the 4 or 8 gigabytes or more of
> > RAM on many computers. Besides, you don't have to keep all of the text
> > in memory, but only the distinct words and an integer count of the
> > number of occurrences. For the King James bible, this is less than
> > 5,000 words, so we're talking a pretty minimal amount of memory. A
> > hash table might work fine for this, or build a balanced binary tree
> > of the words. After you have scanned all of the input text and
> > determined the number of occurrences of each word, it is fairly easy
> > to scan the word counts and pick out the ten largest.
> >
> > Dave
> >
> > On Oct 22, 9:24 am, ligerdave  wrote:
> >
> >
> >
> > > for a large file, you probably would want to use external sort. kinda
> > > like a map-reduce concept. it's actually how sort&uniq kinda stuff
> > > work in unix/linux when you try to find some "TOP X"
> >
> > > again, we are talking about the memory might not hold the entire file
> >
> > > On Oct 21, 9:35 am, "Vinay..."  wrote:
> >
> > > > how do u find 10 most repeating words on a large file containing
> words
> > > > in most efficient way...if it can also be done using heapsort plz
> post
> > > > ur answers..- Hide quoted text -
> >
> > > - Show quoted text -
>
> --
> You received this message because you are subscribed to the Google Groups
> "Algorithm Geeks" group.
> To post to this group, send email to algoge...@googlegroups.com.
> To unsubscribe from this group, send email to
> algogeeks+unsubscr...@googlegroups.com
> .
> For more options, visit this group at
> http://groups.google.com/group/algogeeks?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: 10 most repeating words

2010-10-23 Thread Mridul Malpani
use 2 datstructure : TRIE and an array of words along with their
frequency.

we can solve it in following step:

1) scan each word of the large file. insert the current word into the
TRIE if not present already. update the frequency.
using TRIE, the time complexity is O(L), where l is the total no.
of characters in file.

2) now, you have build up the array of words with frequency. the size
of array = n = no. of words in file.
get k maximum frequency in O(nlgk). here k=10. ( use min heap of k
elements to get this.)

On Oct 23, 8:19 am, Dave  wrote:
> @Ligerdave: Hey, the King James version of the Bible is only about
> 600,000 words. I use the Bible as an example only because it is a
> fairly large book. Maybe we are talking 10 megabytes to store the
> whole thing, seeing that there are some long words such as "Maher-
> shalal-hash-baz," a name that occurs in Isaiah 8:3. Ten megabytes
> hardly seems "large," when compared to the 4 or 8 gigabytes or more of
> RAM on many computers. Besides, you don't have to keep all of the text
> in memory, but only the distinct words and an integer count of the
> number of occurrences. For the King James bible, this is less than
> 5,000 words, so we're talking a pretty minimal amount of memory. A
> hash table might work fine for this, or build a balanced binary tree
> of the words. After you have scanned all of the input text and
> determined the number of occurrences of each word, it is fairly easy
> to scan the word counts and pick out the ten largest.
>
> Dave
>
> On Oct 22, 9:24 am, ligerdave  wrote:
>
>
>
> > for a large file, you probably would want to use external sort. kinda
> > like a map-reduce concept. it's actually how sort&uniq kinda stuff
> > work in unix/linux when you try to find some "TOP X"
>
> > again, we are talking about the memory might not hold the entire file
>
> > On Oct 21, 9:35 am, "Vinay..."  wrote:
>
> > > how do u find 10 most repeating words on a large file containing words
> > > in most efficient way...if it can also be done using heapsort plz post
> > > ur answers..- Hide quoted text -
>
> > - Show quoted text -

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: 10 most repeating words

2010-10-22 Thread Dave
@Ligerdave: Hey, the King James version of the Bible is only about
600,000 words. I use the Bible as an example only because it is a
fairly large book. Maybe we are talking 10 megabytes to store the
whole thing, seeing that there are some long words such as "Maher-
shalal-hash-baz," a name that occurs in Isaiah 8:3. Ten megabytes
hardly seems "large," when compared to the 4 or 8 gigabytes or more of
RAM on many computers. Besides, you don't have to keep all of the text
in memory, but only the distinct words and an integer count of the
number of occurrences. For the King James bible, this is less than
5,000 words, so we're talking a pretty minimal amount of memory. A
hash table might work fine for this, or build a balanced binary tree
of the words. After you have scanned all of the input text and
determined the number of occurrences of each word, it is fairly easy
to scan the word counts and pick out the ten largest.

Dave

On Oct 22, 9:24 am, ligerdave  wrote:
> for a large file, you probably would want to use external sort. kinda
> like a map-reduce concept. it's actually how sort&uniq kinda stuff
> work in unix/linux when you try to find some "TOP X"
>
> again, we are talking about the memory might not hold the entire file
>
> On Oct 21, 9:35 am, "Vinay..."  wrote:
>
>
>
> > how do u find 10 most repeating words on a large file containing words
> > in most efficient way...if it can also be done using heapsort plz post
> > ur answers..- Hide quoted text -
>
> - Show quoted text -

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: 10 most repeating words

2010-10-22 Thread ligerdave
for a large file, you probably would want to use external sort. kinda
like a map-reduce concept. it's actually how sort&uniq kinda stuff
work in unix/linux when you try to find some "TOP X"

again, we are talking about the memory might not hold the entire file

On Oct 21, 9:35 am, "Vinay..."  wrote:
> how do u find 10 most repeating words on a large file containing words
> in most efficient way...if it can also be done using heapsort plz post
> ur answers..

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: 10 most repeating words

2010-10-22 Thread Vinay...
can u plz elaborate how min heap helps to find most repeating words

On Oct 21, 6:40 am, ashish agarwal 
wrote:
> use a array of 10 and apply min heap
>
>
>
> On Thu, Oct 21, 2010 at 7:05 PM, Vinay...  wrote:
> > how do u find 10 most repeating words on a large file containing words
> > in most efficient way...if it can also be done using heapsort plz post
> > ur answers..
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Algorithm Geeks" group.
> > To post to this group, send email to algoge...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > algogeeks+unsubscr...@googlegroups.com > .com>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/algogeeks?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.



[algogeeks] Re: 10 most repeating words

2010-10-21 Thread LinGmnZ
hashing -> separate chaining

On Oct 21, 8:35 pm, "Vinay..."  wrote:
> how do u find 10 most repeating words on a large file containing words
> in most efficient way...if it can also be done using heapsort plz post
> ur answers..

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.