Hi Andrey, On Oct 8, 7:56 pm you wrote:
> ... Enumerating of the words makes no sense. Agreed. > > ... As Vishal suggested a trie looks more realistic. Building the > > trie can be done O(m), with m - total characters in keywords. > > Identifying whether a document character is part of a keyword > > candidate is then O(1). > > What is 'trie' ? > What are you trying to do? Find keyword in input sequence? They are > separated by spaces and punctuations, aren't they? No, I am not trying to do that at all. The trie is built to contain only keywords. It can then be used to answer the question for the current character 'Is this character part of a candidate keyword?' and do this O(1). Candidate keywords are identified initially by the trei root returning a reference to the node representing the first character of the keyword rather than a null. (or, if you prefer, the node linked to the root by that character's edge) > > You asked for O(n) on the length of the sequence... > > ... this is Big O so we can argue that > > O(n)<=>O(doc_words) i.e. n = K(doc_words) with K average wordlength in > > document. > > If n is size of input sequence, d - number of words in input sequence > and n ~ K d then > O(n) ~ O(d). Thanks for that. > > ... Andrey's listing > > has a nested loop to do trimming that rechecks words already checked > > in the main loop. Also the algorithm has a tendency to find longer and > > longer sequences with the same start word that clearly cannot provide > > a new minimum. > > Completely wrong! the algorithm is based on the following invariant: > - It always keeps the shortest possible subsequence that ends at the > current position. > Therefore it will eventually find the shortest one in entire sequence. Calm down. I am not asserting or even suggesting that your algorithm does not find the shortest subsequence in all cases. Forget the nested loop. A subsequence start pointer is needed and it is probably irrelevant whether this is updated in the main loop or a sub-loop. However, compiling and running the listing I got output shown below with initial complete sequence added, subsequences offset to align, and subsequence sizes etc. omitted: abrebbqbcdfagbasdfbbaggqqebbcg--324c abrebbqbc bcdfa cdfagb cdfagba cdfagbasdfb cdfagbasdfbb cdfagbasdfbba cdfagbasdfbbaggqqeb cdfagbasdfbbaggqqebb aggqqebbc aggqqebbcg--324c The shortest subsequence is: range size: 5 [bcdfa] This is what I meant by longer and longer subsequences. For example, once the subsequence cdfagb has been identified it is then impossible that a shorter subsequence exists that starts with the same word 'c' at the same location in the complete sequence. Logically the next possible shorter subsequence starts with the next word 'a'. Checks (result.size() > range.size()) on other 'c' start subsequences are redundant. > > I'm no expert but does this give O(n+m)? > > Yes, and I professed that in listing. But O(n + m) is still O(n) > > Where > n - size of input sequence > m - number of keywords O.k, still O(n), but only because redundant checks ~ m in the worst case, or I think they do. > > I could perhaps do a small application (in Pascal) to test this point. > > Meanwhile what do you think? > > How is your program? Can we have a glimpse of that? Thank you for asking. I was not going to spend time on this if nobody was interested. I will make a start this evening and make sure there is sufficient debug output that you all can criticise it without ploughing through the source code. MartinH. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Algorithm Geeks" group. To post to this group, send email to algogeeks@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/algogeeks -~----------~----~----~----~------~----~------~--~---