Re: [Boston.pm] software puzzle - extracting longest alphabetical list of phrases from a list of words

Ben Tilly Sun, 26 Oct 2008 19:59:39 -0700

On Sun, Oct 26, 2008 at 5:05 PM, Ben Tilly <[EMAIL PROTECTED]> wrote:
> On Sun, Oct 26, 2008 at 10:21 AM, Tolkin, Steve <[EMAIL PROTECTED]> wrote:
>> The following is just a problem in computer science.  It is not directly
>> related to Perl, or to my work.  I am looking for insights in how to
>> think about this.
>>
>> The input: a list of words.
>> The output: a partitioning of the input list into a longest list of
>> phrases, such that the phrases are in alphabetical order.  (Each phrase
>> is one of more consecutive words, and a word is a maximum length
>> sequence of non-space characters.)
> [...]
>
>> I presume this problem is already known to software engineering.  What
>> is its name?  (For example, other problems are solved by "connected
>> components", or "topological sort", etc.)
> [...]
>
> The algorithm at
> http://www.algorithmist.com/index.php/Longest_Increasing_Subsequence
> can be adapted to this problem.  You will need small modifications to
> take into account the fact that "foo bar" is alphabetically before
> "foo baz" even though the element "foo" is the same both times.


Hrm, returning to this problem I see that I was too fast in foisting
you off to some documentation.  From your point of view the
modification to be able to take the dataset:

foo foo foo foo foo foo foo foo foo foo

and get out the increasing subsequence

foo
foo foo
foo foo foo
foo foo foo foo

is not likely to be obvious.

Also the first element has to start the first phrase, and therefore
has to start the subsequence of interest.  That's an important
restriction.

Now you say you just want insights, so I don't want to give you the
full answer.  So I'll work to avoid that.  The hint is that for each
list element you want to know the length of the longest ascending
sequence of phrases is to starting the next phrase at that list
element, and what the previous phrase started on.  If you have this
data structure built up, then you just look through it for the longest
string of phrases to the final phrase.  Then read off you answer
backwards - you know were the final phrase of your longest sequence
starts, that tells you where the next to last one starts, then the one
before that, and so on.

The nasty complication is that you need to not only consider the list
element starting that phrase but also the minimal length of the phrase
that needs to start at that element for the phrases to be in ascending
order.  (Consider the "foo foo foo..." example to see how the next
phrase starting at an element may have a minimum length.)  So if list
element 20 can be the next phrase of length 1 or more after a sequence
of 3 phrases, or the next phrase of length 3 or more after a sequence
of 4 phrases, either of those partial sequences could go on to be your
next answer so you have to keep track of both of them.  But often you
don't have to keep track of more than one.  For instance if list
element 25 could be the start of next phrase of length 1+ after a
sequence of 5 phrases , or the start of the next phrase of length 4+
after a sequence of 5 phrases, or the start of the next phrase of
length 2+ after a sequence of 4 phrases, then you only need to track
the first possible sequence to that point.  (Unless, of course, you
wanted to be able to enumerate all of the sequences of phrases of
maximal length.  In that case you'd want to track the first and second
possibilities.  But the third can be ignored.)

I'll leave it to you to figure out a data structure that can properly
track this information, and the development of an algorithm that can
track it.  But I will tell you that there is an algorithm that is
usually O(n*n) for this problem.  (The "foo foo foo..." example is a
worst case scenario - it is slightly worse for that.  I'm too lazy to
figure it out properly, but I'd guess it to be no worse than about
O(n*n*n).)

Cheers,
Ben

_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] software puzzle - extracting longest alphabetical list of phrases from a list of words

Reply via email to