To Vishal: My idea is similar to yours. I like to use hash table as
well. But I wonder which hash function can you use to insert and find
keywords with O(1) time? Keywords are not single characters. They are
normal words. That's basically what I am aftering.
On Sep 25, 2:11 pm, Mayur [EMAIL PROTECTED] wrote:
Another possibility is to first pre-process the keywords into
automaton-like structure (Google for Aho Corasick string matcher), and
then use it over the document. This would probably be helpful only if
the number of keywords is much smaller than the document itself..
On 9/25/07, daizisheng [EMAIL PROTECTED] wrote:
Vishal 写道:
Hash table should give you O(1) insertion and search complexity; which
is what we need, right?
There is no constraint on space complexity, I believe.
On 9/24/07, * daizisheng* [EMAIL PROTECTED]
mailto:[EMAIL PROTECTED] wrote:
the problem is you need a hash table to maintain all the keywords,:)
because they do not have to be a single characters, or you can
store them in
array, but then you need binary search,:)
Vishal 写道:
How about keeping two pointers - startp and endp.
Keep a count of frequencies of keywords between startp and endp,
both
inclusive. We can use an array / hash table for this.
Now, the minimum length substring has to start with a keyword
and has
to end with a keyword.
1. Initially startp=0 and endp=1. L = infinity
2. Increment endp till you encounter a keyword or it exceeds the
total
length. Update frequencies. Check if you have all the keywords.
(This
can be done in O(1) using a bitmap or similar). If it exceeds the
total length, QUIT.
3. If the str(startp,endp) contains all keywords and length L,
save
values of startp and endp.
4. Now, increment startp, till you get a keyword. If the
str(startp,endp) still contains all keywords, update saved values of
startp and endp.
5. Repeat step 4 till str(startp,endp) does NOT contain all
keywords.
6. Goto step 2.
The final values of startp and endp should give you the minimum
summary.
Since both values go from 0 to N-1, its O(N).
~Vishal
On 9/24/07, *daizisheng* [EMAIL PROTECTED]
mailto:[EMAIL PROTECTED]
mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote:
I think hash method is ok, at lease in expectation way it's
O(n)
why not use it? it's very effeciently
I think there should be some worst case O(n) algorithm, but
it will be
more complex and not as effecient as the above one in practise
Sticker 写道:
Declare: this question is originally from Google. The
original form
is: given a document, how to find a shortest summary
containing all
the key words. After translated, it will be: given a
sequence,
how to
find a shortest substring that contains all the items
required.
Example: a sequence abaccdefgacel and a set of key words
a, c,
d and e. The shortest substring contianing all of the key
words is
accde. Find one of such shortest substrings. In the original
question, there is time complexity boundary O(N), where N
is the
length of the sequence.
To me, this problem is rather questionable. So far my solution
requires a hash table that gives no conflict and can locate a
key word
in O(1) time. Otherwise the time complexity will
definitely be
related
to the number of key words.
Does anyone have idea for a O(N) algorithm to solve this
problem?
yes, that's we need. but seems the starter of this thread who did not
like hash,:)
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups
Algorithm Geeks group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~--~~~~--~~--~--~---