use 2 datstructure : TRIE and an array of words along with their
frequency.

we can solve it in following step:

1) scan each word of the large file. insert the current word into the
TRIE if not present already. update the frequency.
    using TRIE, the time complexity is O(L), where l is the total no.
of characters in file.

2) now, you have build up the array of words with frequency. the size
of array = n = no. of words in file.
    get k maximum frequency in O(nlgk). here k=10. ( use min heap of k
elements to get this.)

On Oct 23, 8:19 am, Dave <dave_and_da...@juno.com> wrote:
> @Ligerdave: Hey, the King James version of the Bible is only about
> 600,000 words. I use the Bible as an example only because it is a
> fairly large book. Maybe we are talking 10 megabytes to store the
> whole thing, seeing that there are some long words such as "Maher-
> shalal-hash-baz," a name that occurs in Isaiah 8:3. Ten megabytes
> hardly seems "large," when compared to the 4 or 8 gigabytes or more of
> RAM on many computers. Besides, you don't have to keep all of the text
> in memory, but only the distinct words and an integer count of the
> number of occurrences. For the King James bible, this is less than
> 5,000 words, so we're talking a pretty minimal amount of memory. A
> hash table might work fine for this, or build a balanced binary tree
> of the words. After you have scanned all of the input text and
> determined the number of occurrences of each word, it is fairly easy
> to scan the word counts and pick out the ten largest.
>
> Dave
>
> On Oct 22, 9:24 am, ligerdave <david.c...@gmail.com> wrote:
>
>
>
> > for a large file, you probably would want to use external sort. kinda
> > like a map-reduce concept. it's actually how sort&uniq kinda stuff
> > work in unix/linux when you try to find some "TOP X"
>
> > again, we are talking about the memory might not hold the entire file
>
> > On Oct 21, 9:35 am, "Vinay..." <vinumars...@gmail.com> wrote:
>
> > > how do u find 10 most repeating words on a large file containing words
> > > in most efficient way...if it can also be done using heapsort plz post
> > > ur answers..- Hide quoted text -
>
> > - Show quoted text -

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algoge...@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.

Reply via email to