--- Chris Devers <[EMAIL PROTECTED]> wrote: > Do you need to handle ambiguities? For example, > "-ough" can famously be pronounced several ways:
The way it's set up now can't deal with them, but I'm about to rewrite the thing to handle more than one segment having the same orthographic representation. Output can be ambiguous without any problems so long as a given bundle of features has only one representation. But the input needs to be non-ambiguous on some level, so you'd want to either assume all "-ough" are the same phoneme and that pronunciation depends on context (a bad assumption in this case), or you'd want to find some way to differentiate them (e.g. by typing your string in as IPA rather than English, or by "making up" segments like "ough{ow}", "ough{off}", etc). Generally, it should be able be swept under the carpet of predetermined lists. > Okay, but I still think that attacking this > problem will be easier if you start out with > these elements in a normal, hand-ordered list, > and then pre-populate the keys of one or more > hashes based on that. My current approach sorts segments by length (long to short) then alphabetically. This allows users to list the segments in any order they please. Maybe they want to list "sch" with "ch" in their German alphabet, or maybe they want to list "sch" with "s"; I think the program should be agnostic as to the order in which they're stored in the file. The only reason I can think of for hand-ordering is if that order represents information that can't be otherwise calculated (such as ordering them by their frequency in the given language in order to speed up parsing input). This is what I'm currently using to get from $string to @segments: my (@length, @letters, @segments, $letters); foreach (keys %{$this->{'alphabet'}}) { push @{$length[length($_)-1]}, $_; } for (reverse [EMAIL PROTECTED]) { push @letters, sort @{$length[$_]}; } $letters = '(' . join('|', @letters) . ')'; while ($string) { if ($string =~ s/^$letters//) { push @segments, $1; } else { return &error("can't parse next segment in '$string'"); } } Using substr() might speed things up, but for now I'm just hoping to get it all to work. > I have a feeling that decomposing the string > into an array of characters might help In short, that's what I'm doing, only replacing the word "characters" with "segments" (which are generally one character long, but not always). I'll look into Parse::RecDescent, but so far I seem unable to grok what I've read about it. ~wren __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free web site building tool. Try it! http://webhosting.yahoo.com/ps/sb/