Hi Andrey,

On Oct 8, 7:56 pm you wrote:

> ... Enumerating of the words makes no sense.

Agreed.

> > ... As Vishal suggested a trie looks more realistic. Building the
> > trie can be done O(m), with m - total characters in keywords.
> > Identifying whether a document character is part of a keyword
> > candidate is then O(1).
>
> What is 'trie' ?
> What are you trying to do? Find keyword in input sequence? They are
> separated by spaces and punctuations, aren't they?

No, I am not trying to do that at all. The trie is built to contain
only keywords. It can then be used to answer the question for the
current character 'Is this character part of a candidate keyword?' and
do this O(1). Candidate keywords are identified initially by the trei
root returning a reference to the node representing the first
character of the keyword rather than a null. (or, if you prefer, the
node linked to the root by that character's edge)

> > You asked for O(n) on the length of the sequence...
> >                     ... this is Big O so we can argue that
> > O(n)<=>O(doc_words) i.e. n = K(doc_words) with K average wordlength in
> > document.
>
> If n is size of input sequence, d - number of words in input sequence
> and n ~ K d then
> O(n) ~ O(d).

Thanks for that.

> > ... Andrey's listing
> > has a nested loop to do trimming that rechecks words already checked
> > in the main loop. Also the algorithm has a tendency to find longer and
> > longer sequences with the same start word that clearly cannot provide
> > a new minimum.
>
> Completely wrong! the algorithm is based on the following invariant:
>   - It always keeps the shortest possible subsequence that ends at the
> current position.
> Therefore it will eventually find the shortest one in entire sequence.

Calm down. I am not asserting or even suggesting that your algorithm
does not find the shortest subsequence in all cases. Forget the nested
loop. A subsequence start pointer is needed and it is probably
irrelevant whether this is updated in the main loop or a sub-loop.

However, compiling and running the listing I got output shown below
with initial complete sequence added, subsequences offset to align,
and subsequence sizes etc. omitted:

   abrebbqbcdfagbasdfbbaggqqebbcg--324c
   abrebbqbc
          bcdfa
           cdfagb
           cdfagba
           cdfagbasdfb
           cdfagbasdfbb
           cdfagbasdfbba
           cdfagbasdfbbaggqqeb
           cdfagbasdfbbaggqqebb
                       aggqqebbc
                       aggqqebbcg--324c
   The shortest subsequence is: range size: 5 [bcdfa]

This is what I meant by longer and longer subsequences. For example,
once the subsequence cdfagb has been identified it is then impossible
that a shorter subsequence exists that starts with the same word 'c'
at the same location in the complete sequence. Logically the next
possible shorter subsequence starts with the next word 'a'. Checks
(result.size() > range.size()) on other 'c' start subsequences are
redundant.

> > I'm no expert but does this give O(n+m)?
>
> Yes, and I professed that in listing. But O(n + m) is still O(n)
>
> Where
>   n  - size of input sequence
>   m - number of keywords

O.k, still O(n), but only because redundant checks ~ m in the worst
case, or I think they do.

> > I could perhaps do a small application (in Pascal) to test this point.
> > Meanwhile what do you think?
>
> How is your program? Can we have a glimpse of that?

Thank you for asking. I was not going to spend time on this if nobody
was interested. I will make a start this evening and make sure there
is sufficient debug output that you all can criticise it without
ploughing through the source code.

MartinH.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

Reply via email to