use 2 datstructure : TRIE and an array of words along with their frequency.
we can solve it in following step: 1) scan each word of the large file. insert the current word into the TRIE if not present already. update the frequency. using TRIE, the time complexity is O(L), where l is the total no. of characters in file. 2) now, you have build up the array of words with frequency. the size of array = n = no. of words in file. get k maximum frequency in O(nlgk). here k=10. ( use min heap of k elements to get this.) On Oct 23, 8:19 am, Dave <dave_and_da...@juno.com> wrote: > @Ligerdave: Hey, the King James version of the Bible is only about > 600,000 words. I use the Bible as an example only because it is a > fairly large book. Maybe we are talking 10 megabytes to store the > whole thing, seeing that there are some long words such as "Maher- > shalal-hash-baz," a name that occurs in Isaiah 8:3. Ten megabytes > hardly seems "large," when compared to the 4 or 8 gigabytes or more of > RAM on many computers. Besides, you don't have to keep all of the text > in memory, but only the distinct words and an integer count of the > number of occurrences. For the King James bible, this is less than > 5,000 words, so we're talking a pretty minimal amount of memory. A > hash table might work fine for this, or build a balanced binary tree > of the words. After you have scanned all of the input text and > determined the number of occurrences of each word, it is fairly easy > to scan the word counts and pick out the ten largest. > > Dave > > On Oct 22, 9:24 am, ligerdave <david.c...@gmail.com> wrote: > > > > > for a large file, you probably would want to use external sort. kinda > > like a map-reduce concept. it's actually how sort&uniq kinda stuff > > work in unix/linux when you try to find some "TOP X" > > > again, we are talking about the memory might not hold the entire file > > > On Oct 21, 9:35 am, "Vinay..." <vinumars...@gmail.com> wrote: > > > > how do u find 10 most repeating words on a large file containing words > > > in most efficient way...if it can also be done using heapsort plz post > > > ur answers..- Hide quoted text - > > > - Show quoted text - -- You received this message because you are subscribed to the Google Groups "Algorithm Geeks" group. To post to this group, send email to algoge...@googlegroups.com. To unsubscribe from this group, send email to algogeeks+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/algogeeks?hl=en.