The following is just a problem in computer science. It is not directly related to Perl, or to my work. I am looking for insights in how to think about this.
The input: a list of words. The output: a partitioning of the input list into a longest list of phrases, such that the phrases are in alphabetical order. (Each phrase is one of more consecutive words, and a word is a maximum length sequence of non-space characters.) The following example shows that maximizing the number of phrases may not produce the answer a person would, but it makes the problem solvable by an algorithm that does not have a set of allowed phrases. If there are two or more lists of the same length assume any one will do as the answer. Example input 1: atta boy catch as catch can Example output 1: atta boy catch as catch can I presume this problem is already known to software engineering. What is its name? (For example, other problems are solved by "connected components", or "topological sort", etc.) Here are a few things I know about solving this problem: It has complexity at most O(2^n) because there are at most 2^n partitions. A brute force algorithm would start with the case of having n partitions, where each word is its own phrase. If this is in alphabetical order we are done. Otherwise try all the cases where there are n-1 partitions, then n-2 partitions, etc. (This algorithm would probably be OK for lists with a reasonable number of words. I cannot estimate the maximum number of word or phrases it could handle on a PC.) Is there a deterministic algorithm in a lower complexity class? I would be happy with a heuristic approach that did pretty well. One possible "score" for determining whether to start a phrase at a word has two components: 1. Its position is the list. A lower position is better, because we want many phrases. This is easily precomputed by a O(n) pass over the list. 2. Its alphabetical order in the list. Again a lower number is better. This can be computed one time in advance by an n log(n) sort. Then maybe something like alpha-beta pruning (a la chess) could be used to evaluate the best position to introduce a phrase. Once we have a phrase starting with some string $s then all words $w to the right of $s such that $w le $s cannot start a phrase. The first phrase always starts with the first word. So we can immediately mark words alphabetically lt this as not able to start a phrase, e.g. "as" in the example above is lt "atta". A heuristic approach might take advantage of this. Is there a greedy approach (one that never backtracks) that emits a reasonable output? P.S. In case anyone is interested in actually writing code to solve this, the alphabetical order is case insensitive. The origin of the problem was doing a View Source on a web page that had a large drop down list, and wanting to reconstruct the list of phrases. P.P.S. Congratulations to Ronald. I predict that in 17 or 18 years he will be helping (or nagging) Tobias with getting his college application material done. The time does fly by. Steve Tolkin _______________________________________________ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm