The following is just a problem in computer science.  It is not directly
related to Perl, or to my work.  I am looking for insights in how to
think about this.  

The input: a list of words.
The output: a partitioning of the input list into a longest list of
phrases, such that the phrases are in alphabetical order.  (Each phrase
is one of more consecutive words, and a word is a maximum length
sequence of non-space characters.) 

The following example shows that maximizing the number of phrases may
not produce the answer a person would, but it makes the problem solvable
by an algorithm that does not have a set of allowed phrases.  If there
are two or more lists of the same length assume any one will do as the
answer.

Example input 1: atta boy catch as catch can

Example output 1: 
atta
boy
catch as
catch can

I presume this problem is already known to software engineering.  What
is its name?  (For example, other problems are solved by "connected
components", or "topological sort", etc.)  

Here are a few things I know about solving this problem: 

It has complexity at most O(2^n) because there are at most 2^n
partitions.  A brute force algorithm would start with the case of having
n partitions, where each word is its own phrase.  If this is in
alphabetical order we are done.  Otherwise try all the cases where there
are n-1 partitions, then n-2 partitions, etc.  (This algorithm would
probably be OK for lists with a reasonable number of words.  I cannot
estimate the maximum number of word or phrases it could handle on a PC.)
Is there a deterministic algorithm in a lower complexity class?

I would be happy with a heuristic approach that did pretty well.  One
possible "score" for determining whether to start a phrase at a word has
two components:
1. Its position is the list.  A lower position is better, because we
want many phrases.  This is easily precomputed by a O(n) pass over the
list.
2. Its alphabetical order in the list.  Again a lower number is better.
This can be computed one time in advance by an n log(n) sort.
Then maybe something like alpha-beta pruning (a la chess) could be used
to evaluate the best position to introduce a phrase.

Once we have a phrase starting with some string $s then all words $w to
the right of $s such that $w le $s cannot start a phrase.  The first
phrase always starts with the first word.  So we can immediately mark
words alphabetically lt this as not able to start a phrase, e.g. "as" in
the example above is lt "atta".  A heuristic approach might take
advantage of this.

Is there a greedy approach (one that never backtracks) that emits a
reasonable output?

P.S. In case anyone is interested in actually writing code to solve
this, the alphabetical order is case insensitive.   The origin of the
problem was doing a View Source on a web page that had a large drop down
list, and wanting to reconstruct the list of phrases.

P.P.S.  Congratulations to Ronald.  I predict that in 17 or 18 years he
will be helping (or nagging) Tobias with getting his college application
material done.   The time does fly by.


Steve Tolkin



_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to