{I already posted this yesterday but did not appear: apologies if
duplicate results}

This looks to be a hard to solve and nastily realistic problem. By
nastily realistic I mean it looks like the kind of thing we might be
asked to code by lunchtime tomorrow.

As daizisheng wrote there is nothing wrong with perfect hashing to
identify keywords, but it may be impractical to implement. With just
'a'..'z' allowed in keywords, max data compression and max keyword
length 10, the hash table needs 2^48 entries. With full 8 bit or ISO
characters its size becomes completely untenable. Obviously not all
locations represent real words but any hash function that takes this
into account to compress the range won't be O(1) and/or can't be
guaranteed perfect.

Similar objections to using a hash table to assign integers to words.
If collisions are allowed, not 0(1). Might just as well have hashed in
the keywords first and accepted worse than O(1). As Vishal suggested a
trie looks more realistic. Building the trie can be done O(m), with m
- total characters in keywords. Identifying whether a document
character is part of a keyword candidate is then O(1).

You asked for O(n) on the length of the sequence. I think this can be
taken as n, the number of characters in the document. Fair because we
have to iterate through all the characters to find the position of
words and keywords. Anyway, this is Big O so we can argue that
O(n)<=>O(doc_words) i.e. n = K(doc_words) with K average wordlength in
document.

Having got O(1) for keyword identification, to get O(n) we need a
scanning algorithm containing a simple loop that advances a pointer to
the next current character in the document, or do we? Andrey's listing
has a nested loop to do trimming that rechecks words already checked
in the main loop. Also the algorithm has a tendency to find longer and
longer sequences with the same start word that clearly cannot provide
a new minimum. I'm no expert but does this give O(n+m)?

A simple loop can be got by maintaining an advancing list of
candidate  keyword locations. By advancing I mean that all words
referenced from the list are as far from the start of the document as
logically possible. Say keywords are a,b,c,d: suppose a,b,c have been
identified and another b is found - the candidate becomes a,c,b. Now a
is found and the candidate becomes c,b,a. Any subsequence starting
with the old word a, must be longer that one starting with word c. If
d is found next, completing the set, the subsequent candidate is
b,a,d.

Doing this avoids maintaining a list of all keyword locations from the
first word found in a candidate subsequence and then using a nested
loop to advance a pointer when a complete set is found. Deletion when
a duplicate occurs can be done by maintaining references from relevant
trie nodes to list elements. List -> trie node references are also
needed to reset a node's reference to null when the tail is deleted
after a complete set.

Sticker wrote

> May I draw the final conclusion that the final time complexity "must"
> have to involve the total length of keywords and the length of the
> input text, rather than linear to the input text's length?

Even storing all relevant keyword data in a single pass through the
document, a case when a keyword is identified is more complex than
otherwise. Actual time complexity is O(n+m+t) - t number of keywords
in document. But we are not magicians, as far as meeting the O(n)
requirement we must assume that m and t are negligible compared to n.
I could perhaps do a small application (in Pascal) to test this point.
Meanwhile what do you think?

MartinH.

On Oct 1, 3:20 am, Sticker <[EMAIL PROTECTED]> wrote:
> En, it is the idea. You assume each keyword is a single character and
> you use a map to hash key words and their counts. Each time you extend
> the range to right hand side, you may increase the counts of some
> found key words and each time you shrink the range from the left hand
> side, you decrease the counts of some out-of-range key words.
>
> For the case that keywords are not single characters, you need a
> complex hash. Even if keywords are single characters, you need
> lg(number of keywords) to locate the position of each keyword in the
> map,which is implemented in the STL map container.
>
> May I draw the final conclusion that the final time complexity "must"
> have to involve the total length of keywords and the length of the
> input text, rather than linear to the input text's length? (Of course
> you can say that in usual case the length of keywords << length of
> input text. But this assumption should not remove the length of
> keywords in the time complexity analysis thus should not sufficient to
> make the algorithm works linear to the length of text.) Am I right?
>
> On Oct 1, 4:10 am, Andrey <[EMAIL PROTECTED]> wrote:
>
>
>
> > Check this program:
>
> > #include <map>
> > #include <string>
> > #include <iostream>
> > #include <algorithm>
> > #include <iterator>
>
> > using namespace std;
>
> > class Range : public pair<const char*, const char*> {
> >   public:
> >         size_t size() const { return second - first; }
> >         Range(const char* f=0, const char * s=0)
> >                 : pair<const char*, const char*>(f ,s) {}
>
> > };
>
> > ostream & operator << (ostream & out, const Range& r) {
> >         out<<"range size: "<<r.size()<<" [";
> >         copy(r.first, r.second, ostream_iterator<char>(out, ""));
> >         out<<"]";
> >         return out;
>
> > }
>
> > typedef map<char, int> SetCnt;
>
> > // array based implementation will have complexity O(m)
> > bool checkSet(const SetCnt & setCnt) {
> >         SetCnt::const_iterator sIt = setCnt.begin(), sEnd = setCnt.end();
> >         for (SetCnt::const_iterator sIt = setCnt.begin(); sIt != sEnd; +
> > +sIt ) {
> >                 if (sIt->second == 0) {
> >                         return false;
> >                 }
> >         }
> >         return true;
>
> > }
>
> > // array based implementation will have complexity O(1)
> > bool checkAndShrink(SetCnt & setCnt, char c) {
> >         SetCnt::iterator sIt = setCnt.find(c);
> >         if (sIt == setCnt.end()) {
> >                 return true;
> >         }
> >         if (sIt->second > 1) {
> >                 sIt->second--;
> >                 return true;
> >         }
> >         return false;
>
> > }
>
> > // aproximate complexity is O(n*m + m)
> > //   where n - is size of input sequence and m - is number of key
> > words.
> > Range solve(const char * strBegin, const char * strEnd,
> >                         const char * setBegin, const char * setEnd) {
> >         SetCnt setCnt;
>
> >         // complexity O(m)
> >         for (const char * cIt = setBegin; cIt != setEnd; cIt++ ) {
> >                 setCnt[*cIt] = 0;
> >         }
>
> >         bool solved = false;
> >         Range range(strBegin, strEnd), result(range);
> >         for (const char * sIt = strBegin; sIt != strEnd; sIt++ ) {
> >                 SetCnt::iterator it = setCnt.find(*sIt);
> >                 if (it != setCnt.end()) {
> >                         setCnt[*sIt] += 1;
> >                         range.second = sIt + 1;
> >                         if (checkSet(setCnt)) {
> >                                 solved = true;
> >                                 while (checkAndShrink(setCnt, 
> > *range.first)) {
> >                                         range.first++;
> >                                 }
> >                                 if (result.size() > range.size()) {
> >                                         result = range;
> >                                 }
> >                                 // Debug
> >                                 cout<<range<<endl;
> >                         }
> >                 }
> >         }
> >         return solved ? result : Range(0, 0);
>
> > }
>
> > int main() {
> >         const char * string = "abrebbqbcdfagbasdfbbaggqqebbcg--324c";
> >         const char * set = "abc";
>
> >         Range range = solve(string, string + strlen(string), set, set +
> > strlen(set));
> >         if (range.size()) {
> >                 cout<<"The shortes subsequence is: "<<range<<endl;
> >         } else {
> >                 cout<<"No solution..."<<endl;
> >         }
> >         return 0;
>
> > }- Hide quoted text -
>
> - Show quoted text -


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

Reply via email to