[algogeeks] Re: Google Interview Question: find shortest summary containing all key words

Shrenik Fri, 05 Oct 2007 06:07:01 -0700

<reposting since it didn't appear yesterday, apologies>

A small variation of Vishal's algorithm to eliminate the need of the
bitmap. I use a hash table of integers index by the keyword,
initialized to all 0. I also make use of the property that in the
shortest summary the first and the last keyword will appear exactly
once.


1. foreach word in document
        hash[word]++; // by the end of this loop we should have
frequencies of all keywords

2. Do a preliminary check to see if each hash[keyword] has frequency
>= 0. If at least one has frequency = 0, stop and return null to
indicate summary not found.

4. startp = pointer to first word, endp = pointer to last word

3. for (; startp <= endp; startp++)
        if (hash[*startp] == 1)  // stop when keyword with frequency =
1 is found
           break;
        else
           hash[*startp]--; // by end of this loop, startp will point
to the first word of the summary, i.e. keyword that appears once

4. for (; startp <= endp; endp--)
        if (hash[*endp] == 1)  // stop when keyword with frequency = 1
is found
           break;
        else
           hash[*endp]--; // by end of this loop, endp will point to
the last word of the summary, i.e. keyword that appears once

5. return summary which is from startp to endp

Worst case is O(2N) = O(N). Also, if more than one shortest summaries
are present, this will return the last one.

-Shrenik

On Oct 5, 6:34 am, MartinH <[EMAIL PROTECTED]> wrote:
> {I already posted this yesterday but did not appear: apologies if
> duplicate results}
>
> This looks to be a hard to solve and nastily realistic problem. By
> nastily realistic I mean it looks like the kind of thing we might be
> asked to code by lunchtime tomorrow.
>
> As daizisheng wrote there is nothing wrong with perfect hashing to
> identify keywords, but it may be impractical to implement. With just
> 'a'..'z' allowed in keywords, max data compression and max keyword
> length 10, the hash table needs 2^48 entries. With full 8 bit or ISO
> characters its size becomes completely untenable. Obviously not all
> locations represent real words but any hash function that takes this
> into account to compress the range won't be O(1) and/or can't be
> guaranteed perfect.
>
> Similar objections to using a hash table to assign integers to words.
> If collisions are allowed, not 0(1). Might just as well have hashed in
> the keywords first and accepted worse than O(1). As Vishal suggested a
> trie looks more realistic. Building the trie can be done O(m), with m
> - total characters in keywords. Identifying whether a document
> character is part of a keyword candidate is then O(1).
>
> You asked for O(n) on the length of the sequence. I think this can be
> taken as n, the number of characters in the document. Fair because we
> have to iterate through all the characters to find the position of
> words and keywords. Anyway, this is Big O so we can argue that
> O(n)<=>O(doc_words) i.e. n = K(doc_words) with K average wordlength in
> document.
>
> Having got O(1) for keyword identification, to get O(n) we need a
> scanning algorithm containing a simple loop that advances a pointer to
> the next current character in the document, or do we? Andrey's listing
> has a nested loop to do trimming that rechecks words already checked
> in the main loop. Also the algorithm has a tendency to find longer and
> longer sequences with the same start word that clearly cannot provide
> a new minimum. I'm no expert but does this give O(n+m)?
>
> A simple loop can be got by maintaining an advancing list of
> candidate  keyword locations. By advancing I mean that all words
> referenced from the list are as far from the start of the document as
> logically possible. Say keywords are a,b,c,d: suppose a,b,c have been
> identified and another b is found - the candidate becomes a,c,b. Now a
> is found and the candidate becomes c,b,a. Any subsequence starting
> with the old word a, must be longer that one starting with word c. If
> d is found next, completing the set, the subsequent candidate is
> b,a,d.
>
> Doing this avoids maintaining a list of all keyword locations from the
> first word found in a candidate subsequence and then using a nested
> loop to advance a pointer when a complete set is found. Deletion when
> a duplicate occurs can be done by maintaining references from relevant
> trie nodes to list elements. List -> trie node references are also
> needed to reset a node's reference to null when the tail is deleted
> after a complete set.
>
> Sticker wrote
>
> > May I draw the final conclusion that the final time complexity "must"
> > have to involve the total length of keywords and the length of the
> > input text, rather than linear to the input text's length?
>
> Even storing all relevant keyword data in a single pass through the
> document, a case when a keyword is identified is more complex than
> otherwise. Actual time complexity is O(n+m+t) - t number of keywords
> in document. But we are not magicians, as far as meeting the O(n)
> requirement we must assume that m and t are negligible compared to n.
> I could perhaps do a small application (in Pascal) to test this point.
> Meanwhile what do you think?
>
> MartinH.
>
> On Oct 1, 3:20 am, Sticker <[EMAIL PROTECTED]> wrote:
>
> > En, it is the idea. You assume each keyword is a single character and
> > you use a map to hash key words and their counts. Each time you extend
> > the range to right hand side, you may increase the counts of some
> > found key words and each time you shrink the range from the left hand
> > side, you decrease the counts of some out-of-range key words.
>
> > For the case that keywords are not single characters, you need a
> > complex hash. Even if keywords are single characters, you need
> > lg(number of keywords) to locate the position of each keyword in the
> > map,which is implemented in the STL map container.
>
> > May I draw the final conclusion that the final time complexity "must"
> > have to involve the total length of keywords and the length of the
> > input text, rather than linear to the input text's length? (Of course
> > you can say that in usual case the length of keywords << length of
> > input text. But this assumption should not remove the length of
> > keywords in the time complexity analysis thus should not sufficient to
> > make the algorithm works linear to the length of text.) Am I right?
>
> > On Oct 1, 4:10 am, Andrey <[EMAIL PROTECTED]> wrote:
>
> > > Check this program:
>
> > > #include <map>
> > > #include <string>
> > > #include <iostream>
> > > #include <algorithm>
> > > #include <iterator>
>
> > > using namespace std;
>
> > > class Range : public pair<const char*, const char*> {
> > >   public:
> > >         size_t size() const { return second - first; }
> > >         Range(const char* f=0, const char * s=0)
> > >                 : pair<const char*, const char*>(f ,s) {}
>
> > > };
>
> > > ostream & operator << (ostream & out, const Range& r) {
> > >         out<<"range size: "<<r.size()<<" [";
> > >         copy(r.first, r.second, ostream_iterator<char>(out, ""));
> > >         out<<"]";
> > >         return out;
>
> > > }
>
> > > typedef map<char, int> SetCnt;
>
> > > // array based implementation will have complexity O(m)
> > > bool checkSet(const SetCnt & setCnt) {
> > >         SetCnt::const_iterator sIt = setCnt.begin(), sEnd = setCnt.end();
> > >         for (SetCnt::const_iterator sIt = setCnt.begin(); sIt != sEnd; +
> > > +sIt ) {
> > >                 if (sIt->second == 0) {
> > >                         return false;
> > >                 }
> > >         }
> > >         return true;
>
> > > }
>
> > > // array based implementation will have complexity O(1)
> > > bool checkAndShrink(SetCnt & setCnt, char c) {
> > >         SetCnt::iterator sIt = setCnt.find(c);
> > >         if (sIt == setCnt.end()) {
> > >                 return true;
> > >         }
> > >         if (sIt->second > 1) {
> > >                 sIt->second--;
> > >                 return true;
> > >         }
> > >         return false;
>
> > > }
>
> > > // aproximate complexity is O(n*m + m)
> > > //   where n - is size of input sequence and m - is number of key
> > > words.
> > > Range solve(const char * strBegin, const char * strEnd,
> > >                         const char * setBegin, const char * setEnd) {
> > >         SetCnt setCnt;
>
> > >         // complexity O(m)
> > >         for (const char * cIt = setBegin; cIt != setEnd; cIt++ ) {
> > >                 setCnt[*cIt] = 0;
> > >         }
>
> > >         bool solved = false;
> > >         Range range(strBegin, strEnd), result(range);
> > >         for (const char * sIt = strBegin; sIt != strEnd; sIt++ ) {
> > >                 SetCnt::iterator it = setCnt.find(*sIt);
> > >                 if (it != setCnt.end()) {
> > >                         setCnt[*sIt] += 1;
> > >                         range.second = sIt + 1;
> > >                         if (checkSet(setCnt)) {
> > >                                 solved = true;
> > >                                 while (checkAndShrink(setCnt, 
> > > *range.first)) {
> > >                                         range.first++;
> > >                                 }
> > >                                 if (result.size() > range.size()) {
> > >                                         result = range;
> > >                                 }
> > >                                 // Debug
> > >                                 cout<<range<<endl;
> > >                         }
> > >                 }
> > >         }
> > >         return solved ? result : Range(0, 0);
>
> > > }
>
> > > int main() {
> > >         const char * string = "abrebbqbcdfagbasdfbbaggqqebbcg--324c";
> > >         const char * set = "abc";
>
> > >         Range range = solve(string, string + strlen(string), set, set +
> > > strlen(set));
> > >         if (range.size()) {
> > >                 cout<<"The shortes subsequence is: "<<range<<endl;
> > >         } else {
> > >                 cout<<"No solution..."<<endl;
> > >         }
> > >         return 0;
>
> > > }- Hide quoted text -
>
> > - Show quoted text -


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

[algogeeks] Re: Google Interview Question: find shortest summary containing all key words

Reply via email to