Re: tricky parsing question

wren argetlahm Fri, 23 Jan 2004 17:44:49 -0800

--- Chris Devers <[EMAIL PROTECTED]> wrote:
> Do you need to handle ambiguities? For example,
> "-ough" can famously be pronounced several ways:


The way it's set up now can't deal with them, but I'm
about to rewrite the thing to handle more than one
segment having the same orthographic representation.
Output can be ambiguous without any problems so long
as a given bundle of features has only one
representation. But the input needs to be
non-ambiguous on some level, so you'd want to either
assume all "-ough" are the same phoneme and that
pronunciation depends on context (a bad assumption in
this case), or you'd want to find some way to
differentiate them (e.g. by typing your string in as
IPA rather than English, or by "making up" segments
like "ough{ow}", "ough{off}", etc).  Generally, it
should be able be swept under the carpet of
predetermined lists.

> Okay, but I still think that attacking this 
> problem will be easier if you start out with 
> these elements in a normal, hand-ordered list, 
> and then pre-populate the keys of one or more 
> hashes based on that.

My current approach sorts segments by length (long to
short) then alphabetically. This allows users to list
the segments in any order they please. Maybe they want
to list "sch" with "ch" in their German alphabet, or
maybe they want to list "sch" with "s"; I think the
program should be agnostic as to the order in which
they're stored in the file. The only reason I can
think of for hand-ordering is if that order represents
information that can't be otherwise calculated (such
as ordering them by their frequency in the given
language in order to speed up parsing input). This is
what I'm currently using to get from $string to
@segments:

my (@length, @letters, @segments, $letters);
foreach (keys %{$this->{'alphabet'}}) {
        push @{$length[length($_)-1]}, $_;
}
for (reverse [EMAIL PROTECTED]) {
        push @letters, sort @{$length[$_]};
}
$letters = '(' . join('|', @letters) . ')';
while ($string) {
        if ($string =~ s/^$letters//) {
                push @segments, $1;
        } else {
                return &error("can't parse next segment in
'$string'");
        }
} 

Using substr() might speed things up, but for now I'm
just hoping to get it all to work.

> I have a feeling that decomposing the string 
> into an array of characters might help

In short, that's what I'm doing, only replacing the
word "characters" with "segments" (which are generally
one character long, but not always). I'll look into
Parse::RecDescent, but so far I seem unable to grok
what I've read about it.

~wren

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/

Re: tricky parsing question

Reply via email to