reposting since it didn't appear yesterday, apologies
A small variation of Vishal's algorithm to eliminate the need of the
bitmap. I use a hash table of integers index by the keyword,
initialized to all 0. I also make use of the property that in the
shortest summary the first and the last keyword will appear exactly
once.
1. foreach word in document
hash[word]++; // by the end of this loop we should have
frequencies of all keywords
2. Do a preliminary check to see if each hash[keyword] has frequency
= 0. If at least one has frequency = 0, stop and return null to
indicate summary not found.
4. startp = pointer to first word, endp = pointer to last word
3. for (; startp = endp; startp++)
if (hash[*startp] == 1) // stop when keyword with frequency =
1 is found
break;
else
hash[*startp]--; // by end of this loop, startp will point
to the first word of the summary, i.e. keyword that appears once
4. for (; startp = endp; endp--)
if (hash[*endp] == 1) // stop when keyword with frequency = 1
is found
break;
else
hash[*endp]--; // by end of this loop, endp will point to
the last word of the summary, i.e. keyword that appears once
5. return summary which is from startp to endp
Worst case is O(2N) = O(N). Also, if more than one shortest summaries
are present, this will return the last one.
-Shrenik
On Oct 5, 6:34 am, MartinH [EMAIL PROTECTED] wrote:
{I already posted this yesterday but did not appear: apologies if
duplicate results}
This looks to be a hard to solve and nastily realistic problem. By
nastily realistic I mean it looks like the kind of thing we might be
asked to code by lunchtime tomorrow.
As daizisheng wrote there is nothing wrong with perfect hashing to
identify keywords, but it may be impractical to implement. With just
'a'..'z' allowed in keywords, max data compression and max keyword
length 10, the hash table needs 2^48 entries. With full 8 bit or ISO
characters its size becomes completely untenable. Obviously not all
locations represent real words but any hash function that takes this
into account to compress the range won't be O(1) and/or can't be
guaranteed perfect.
Similar objections to using a hash table to assign integers to words.
If collisions are allowed, not 0(1). Might just as well have hashed in
the keywords first and accepted worse than O(1). As Vishal suggested a
trie looks more realistic. Building the trie can be done O(m), with m
- total characters in keywords. Identifying whether a document
character is part of a keyword candidate is then O(1).
You asked for O(n) on the length of the sequence. I think this can be
taken as n, the number of characters in the document. Fair because we
have to iterate through all the characters to find the position of
words and keywords. Anyway, this is Big O so we can argue that
O(n)=O(doc_words) i.e. n = K(doc_words) with K average wordlength in
document.
Having got O(1) for keyword identification, to get O(n) we need a
scanning algorithm containing a simple loop that advances a pointer to
the next current character in the document, or do we? Andrey's listing
has a nested loop to do trimming that rechecks words already checked
in the main loop. Also the algorithm has a tendency to find longer and
longer sequences with the same start word that clearly cannot provide
a new minimum. I'm no expert but does this give O(n+m)?
A simple loop can be got by maintaining an advancing list of
candidate keyword locations. By advancing I mean that all words
referenced from the list are as far from the start of the document as
logically possible. Say keywords are a,b,c,d: suppose a,b,c have been
identified and another b is found - the candidate becomes a,c,b. Now a
is found and the candidate becomes c,b,a. Any subsequence starting
with the old word a, must be longer that one starting with word c. If
d is found next, completing the set, the subsequent candidate is
b,a,d.
Doing this avoids maintaining a list of all keyword locations from the
first word found in a candidate subsequence and then using a nested
loop to advance a pointer when a complete set is found. Deletion when
a duplicate occurs can be done by maintaining references from relevant
trie nodes to list elements. List - trie node references are also
needed to reset a node's reference to null when the tail is deleted
after a complete set.
Sticker wrote
May I draw the final conclusion that the final time complexity must
have to involve the total length of keywords and the length of the
input text, rather than linear to the input text's length?
Even storing all relevant keyword data in a single pass through the
document, a case when a keyword is identified is more complex than
otherwise. Actual time complexity is O(n+m+t) - t number of keywords
in document. But we are not magicians, as far as meeting the O(n)
requirement we must assume that m