Re: tricky parsing question

2004-02-12 Thread Matt Diephouse
Wren Argetlahm wrote:
I'm working on a linguistic module and I'm trying to
find a good way to split a string up into "segments".
I can't assume single charecter strings and want to
assume maximal segments. As an example, the word
"church" would be rendered as the list ('ch', 'u',
'r', 'ch') and wouldn't break the "ch" up smaller even
though both "c" and "h" are valid segments in English.
I have all the valid segments for a given language
stored as keys in a hash, now I just need an algorithm
to chop up a string into a list. Any ideas?
~wren
Why do you need to sort by alphabet after sorting by length? How about 
something like this:

#!/usr/bin/perl -w
use strict;
my $text = "church";
my @alphabet = qw(c h ch u r);
# make a regex with many alterations (compiled only once)
# that puts the longest characters first
my $letters = join "|" => reverse sort @alphabet;
my $regex = qr/($letters)/;
# then try to remove the first character from the string
# using the regex (it matches the longest segments first)
# and add it to @segments until our character doesn't
# match anymore
my @segments;
push @segments, $1 while $text =~ s/^$regex//;
print join "-" => @segments;

Hope this helps,

matt diephouse
-
http://matt.diephouse.com


Re: tricky parsing question

2004-02-12 Thread Paul Hoffman
Sorry for the delay; I'm catching up.

On Thu Jan 22, 2004, wren argetlahm <[EMAIL PROTECTED]> wrote:

I'm working on a linguistic module
Great!  I'm an avid would-be linguist myself.

and I'm trying to
find a good way to split a string up into "segments".
I can't assume single charecter strings and want to
assume maximal segments. As an example, the word
"church" would be rendered as the list ('ch', 'u',
'r', 'ch') and wouldn't break the "ch" up smaller even
though both "c" and "h" are valid segments in English.
I have all the valid segments for a given language
stored as keys in a hash, now I just need an algorithm
to chop up a string into a list. Any ideas?
~wren

 1  #!/usr/bin/perl -w
 2
 3  use strict;
 4
 5  my @segments = qw(a b ch d f h i j k m n r s sh t u w y);
 6
 7  my $regex_string
 8  = join( '|', sort { length($b) <=> length($a) } @segments );
 9  # --> 'ch|sh|b|d|f|h|i|j|k|m|n|r|s|a|t|u|w|y'
10
11  my $regex = qr/$regex_string/;
12
13  my $string = 'afashiwaku';
14
15  print join(' ', segments_in($string)), "\n";
16  # --> prints "a f a sh i w a k u\n"
17
18  sub segments_in {
19  my ($str) = @_;
20  return $str =~ /($regex)/g;
21  }
The magic lies in lines 8 (sort in descending order of segment length) 
and 20 (use the /g regex modifier to get a copy of all segments when 
called in list context).

Since your data is in a hash, you'll need to use keys(%myhash) in place 
of @segments in line 8.

HTH,

Paul.

--
Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan
[EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/


Re: tricky parsing question

2004-01-23 Thread wren argetlahm
--- Chris Devers <[EMAIL PROTECTED]> wrote:
> Do you need to handle ambiguities? For example,
> "-ough" can famously be pronounced several ways:

The way it's set up now can't deal with them, but I'm
about to rewrite the thing to handle more than one
segment having the same orthographic representation.
Output can be ambiguous without any problems so long
as a given bundle of features has only one
representation. But the input needs to be
non-ambiguous on some level, so you'd want to either
assume all "-ough" are the same phoneme and that
pronunciation depends on context (a bad assumption in
this case), or you'd want to find some way to
differentiate them (e.g. by typing your string in as
IPA rather than English, or by "making up" segments
like "ough{ow}", "ough{off}", etc).  Generally, it
should be able be swept under the carpet of
predetermined lists.

> Okay, but I still think that attacking this 
> problem will be easier if you start out with 
> these elements in a normal, hand-ordered list, 
> and then pre-populate the keys of one or more 
> hashes based on that.

My current approach sorts segments by length (long to
short) then alphabetically. This allows users to list
the segments in any order they please. Maybe they want
to list "sch" with "ch" in their German alphabet, or
maybe they want to list "sch" with "s"; I think the
program should be agnostic as to the order in which
they're stored in the file. The only reason I can
think of for hand-ordering is if that order represents
information that can't be otherwise calculated (such
as ordering them by their frequency in the given
language in order to speed up parsing input). This is
what I'm currently using to get from $string to
@segments:

my (@length, @letters, @segments, $letters);
foreach (keys %{$this->{'alphabet'}}) {
push @{$length[length($_)-1]}, $_;
}
for (reverse [EMAIL PROTECTED]) {
push @letters, sort @{$length[$_]};
}
$letters = '(' . join('|', @letters) . ')';
while ($string) {
if ($string =~ s/^$letters//) {
push @segments, $1;
} else {
return &error("can't parse next segment in
'$string'");
}
} 

Using substr() might speed things up, but for now I'm
just hoping to get it all to work.

> I have a feeling that decomposing the string 
> into an array of characters might help

In short, that's what I'm doing, only replacing the
word "characters" with "segments" (which are generally
one character long, but not always). I'll look into
Parse::RecDescent, but so far I seem unable to grok
what I've read about it.

~wren

__
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/


Re: tricky parsing question

2004-01-23 Thread Chris Devers
On Thu, 22 Jan 2004, wren argetlahm wrote:

> --- Rick Measham <[EMAIL PROTECTED]> wrote:
> > Wren, when you say 'segments' it appears you
> > mean phonemes or phonetics.
>
> Yeah, I do mean phonemes (or something like it). The
> module is language independent, but I'll check those
> modules out.

That's probably a good approach. Even if one of the existing linguistics
modules isn't quite right, if it's good OO code you should be able to
sub-class it to add the functionality you need.

It seems like there are a lot of subtleties to be considered; standing on
the shoulders of others who have worked on this may save a lot of pain.

> --- Chris Devers <[EMAIL PROTECTED]> wrote:
> > Your definition of "segment" here is vague; is
> > it safe to ignore that and just accept that a
> > canonical list of each language's 'segments' is
> > a static thing that is already stored as hash
> > keys?
>
> By "segment" I mean the smallest charecter or sequence
> of charecters that has a regular pronunciation. But
> yes, it's safe to ignore that and assume there's a
> canonical list of "segments" already in memory.

Do you need to handle ambiguities? For example, "-ough" can famously be
pronounced several ways:

bough -> 'bow'
cough -> 'koff'
dough -> 'doh'
rough -> 'ruff'
tough -> 'tuff'

And my copy of /usr/share/dict/words also has words I don't know:

hough -> 'hock'  [seems to be synonymous with 'hock', meaning
 a bone joint, used by butchers for pork]
jough -> I can't find a definition or pronounciation
lough -> 'lock'  [same as 'loch', as in 'Loch Ness Monster']
sough -> 'soh'   [synonymous with 'field that is farmed' or 'groan']
wough -> I can't find a definition or pronunciation

So that gives at least four or five ways to say '-ough', and maybe more.

Does your code need to handle such things? Or can we, again, assume that
this has been swept under the carpet of predetermined lists?


> I am indeed associating the segments with values,
> hence storing them as keys in a hash.

Okay, but I still think that attacking this problem will be easier if you
start out with these elements in a normal, hand-ordered list, and then
pre-populate the keys of one or more hashes based on that. So, making this
up and not intending this to be a perfect or complete approach to things:

my ( @en_segs, @fr_segs, @de_segs, [],
 %en_hash, %fr_hash, %de_hash, [] );

# populate the arrays with predetermined lists
@en_segs = qw [   # English
ough ious ion [] #  four & three letter segments
ch sh th []  #  two letter segments
x y z#  one letter segments
];
@fr_segs = qw[ [] ];  # French, repeat as above with English
@de_segs = qw[ [] ];  # German, and so on

# convert those lists to hashes
# probably a more idiomatic way to do this, but whatever:
foreach $key @en_segs {
%en_hash{$_} = "";
}

# this might be the more idiomatic way? this is untested...
%fr_hash{$_} = "" foreach @fr_segs;
%de_hash{$_} = "" foreach @de_segs; # etc for other languages

At that point, you've got the data stored twice, and can begin working:

my ( $string, $max_seg, $offset, $cur_str );
$string = 'supercalafragalisticixpyaladocious';
$max_seg = length( $en_segs[0] );
# ^ because the array is hand sorted, you know that the
#   first element will always have the longest segment
$offset = 0;

$cur_str = substr( $string, $offset, $max_seg);

(If it's not obvious, I'm making this up as I type -- improvements are
welcome.) From this point, you need to "walk the string", getting
substrings into $cur_str, then looking for the longest part of that
substring that exists in @en_segs, then incrementing $offset based on the
longest match you found, moving to the next $cur_str, and repeat until the
string is exhausted. Obviously, the last line or two up there ought to be
wrapped in a while loop or something to make this work.

I have a feeling that grep may help with the array lookups, but I can't
think of how to phrase the line[s] that would do it.

I have a feeling that decomposing the string into an array of characters
might help (maybe with grep, etc), but then I have a feeling that doing
that would be treating this too much like a C program, and Perl shouldn't
have to stoop to parsing strings the way C does.

I still have a feeling that Parse::RecDescent would make all of this a lot
easier, but I'm not the one to walk you through using that module. This
really does seem like the sort of problem that RecDescent is best at
though, so it's worth looking up some of Damian Conway's documentation for
the module. If you can get your head around it, it's probably *way* more
effective than most any other approach anyone could suggest.



-- 
Chris Devers


Re: tricky parsing question

2004-01-23 Thread Wiggins d Anconia


> On Thu, 22 Jan 2004, wren argetlahm wrote:
> 



> 
> Maybe Parse::RecDescent? Maybe I'm over-thinking this...
> 
> 

This is what I thought of immediately, an old but excellent article
maybe a good place to start:

http://search.cpan.org/src/DCONWAY/Parse-RecDescent-1.94/tutorial/tutorial.html

http://danconia.org


Re: tricky parsing question

2004-01-23 Thread David Cantrell
On Thu, Jan 22, 2004 at 09:28:30PM -0800, wren argetlahm wrote:
> --- Bill Stephenson <[EMAIL PROTECTED]> wrote:
> > You need to get a book on regex's.
> I know the solution lies in regex's

I don't.  I expect the code would be a lot clearer and considerably
quicker if you pull your strings apart using index(), rindex() and
substr().

-- 
David Cantrell |  Degenerate  | http://www.cantrell.org.uk/david

If you have received this email in error, please add some nutmeg
and egg whites, whisk, and place in a warm oven for 40 minutes.


Re: tricky parsing question

2004-01-23 Thread Bill Stephenson
Well, both the problem and the project are way over my head, but it 
looks to me like something that will only be solved with brute force.

I think if your string is split into words and the segments are sorted 
longest to shortest, alphabetically, then your could sort words by the 
first letter in the first segment and move onto the next letter in the 
next segment and exit the loop for that word when the last segment has 
been identified. Does that make any sense?

Shawn McKinley wrote the regex gizmo at perlhelp.com. You might send 
him a note and ask him what he thinks. He lurks on the macosx list, but 
seldom reads every message. He's pretty good with regex's and your 
problem is certainly an interesting one.

You can reach him at [EMAIL PROTECTED]

Let me know what you come up with.

Kindest Regards,

Bill Stephenson

On Jan 22, 2004, at 11:28 PM, wren argetlahm wrote:

--- Bill Stephenson <[EMAIL PROTECTED]> wrote:
You need to get a book on regex's.
I know the solution lies in regex's, the problem is
that I can't quite figure out a generic enough way of
doing it. The problem is for a module and so the list
of valid segments is user defined. I guess I could do
something like:
$segs = '('. join('|', @segs) .')';
$string =~ s/^$segs//;
$first_seg = $1;
But I'd have to sort @segs somehow so that the longest
segments come first, and since alphabets can have many
many different segments, I worry about memory issues.
--- Bill Stephenson <[EMAIL PROTECTED]> wrote:
Perl.com has the best available, "Mastering
Regular Expressions" is what you want.
Sounds like a formidable task though. For some
additional help with your regex you can play
with a tool posted on the "perlhelp.com" web
site. Go to "Resources" and look for the
"Regular Expression Explanation Generator".
Thanks, I'll have to check those out sometime.

--- Rick Measham <[EMAIL PROTECTED]> wrote:
Wren, when you say 'segments' it appears you
mean phonemes or phonetics.
Yeah, I do mean phonemes (or something like it). The
module is language independent, but I'll check those
modules out.
--- Chris Devers <[EMAIL PROTECTED]> wrote:
Your definition of "segment" here is vague; is
it safe to ignore that and just accept that a
canonical list of each language's 'segments' is
a static thing that is already stored as hash
keys?
By "segment" I mean the smallest charecter or sequence
of charecters that has a regular pronunciation. But
yes, it's safe to ignore that and assume there's a
cannonical list of "segments" already in memory.
I am indeed associating the segments with values,
hence storing them as keys in a hash. Also, by storing
them that way, if I'm trying to find the values
associated with a given segment, I can quickly find it
by $all_segments{$segment_in_question} rather than
needing to do a for or foreach loop over an array of
an estimated 15..50 items.
The loop based off the longest element thing sounds
like a good idea, I'll see if I can get it to work.
For those who wonder what on earth I'm up to... it's
an OO module for autosegmental phonology. In short you
feed the object a string and an "alphabet" which maps
segments to values ("d" has +voicing, +dental,
-vocalic, etc) and it creates an array of hashes (or
hash of arrays) where the index is the sequence number
of the segment in the string, and where the key is the
name of the "tier" (voicing, dental, vocalic, etc).
Then there'll be ways to muck around with the object
ala phonetic rules. Then there'll be a method to tie
all of the tiers back together into a single string
(per the alphabet) and spit it back out.
__
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/




Re: tricky parsing question

2004-01-23 Thread Bill Stephenson
wren,

You need to get a book on regex's. Perl.com has the best available, 
"Mastering Regular Expressions" is what you want.

Sounds like a formidable task though. For some additional help with 
your regex you can play with a tool posted on the "perlhelp.com" web 
site. Go to "Resources" and look for the "Regular Expression 
Explanation Generator".

Kindest Regards,

Bill Stephenson

On Jan 22, 2004, at 8:21 PM, wren argetlahm wrote:

I'm working on a linguistic module and I'm trying to
find a good way to split a string up into "segments".
I can't assume single charecter strings and want to
assume maximal segments. As an example, the word
"church" would be rendered as the list ('ch', 'u',
'r', 'ch') and wouldn't break the "ch" up smaller even
though both "c" and "h" are valid segments in English.
I have all the valid segments for a given language
stored as keys in a hash, now I just need an algorithm
to chop up a string into a list. Any ideas?
~wren

__
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/




Re: tricky parsing question

2004-01-22 Thread wren argetlahm
--- Bill Stephenson <[EMAIL PROTECTED]> wrote:
> You need to get a book on regex's.

I know the solution lies in regex's, the problem is
that I can't quite figure out a generic enough way of
doing it. The problem is for a module and so the list
of valid segments is user defined. I guess I could do
something like:

$segs = '('. join('|', @segs) .')';
$string =~ s/^$segs//;
$first_seg = $1;

But I'd have to sort @segs somehow so that the longest
segments come first, and since alphabets can have many
many different segments, I worry about memory issues.

--- Bill Stephenson <[EMAIL PROTECTED]> wrote:
> Perl.com has the best available, "Mastering 
> Regular Expressions" is what you want.
> 
> Sounds like a formidable task though. For some
> additional help with your regex you can play 
> with a tool posted on the "perlhelp.com" web 
> site. Go to "Resources" and look for the 
> "Regular Expression Explanation Generator".

Thanks, I'll have to check those out sometime.

--- Rick Measham <[EMAIL PROTECTED]> wrote:
> Wren, when you say 'segments' it appears you 
> mean phonemes or phonetics.

Yeah, I do mean phonemes (or something like it). The
module is language independent, but I'll check those
modules out.

--- Chris Devers <[EMAIL PROTECTED]> wrote:
> Your definition of "segment" here is vague; is 
> it safe to ignore that and just accept that a 
> canonical list of each language's 'segments' is 
> a static thing that is already stored as hash 
> keys?

By "segment" I mean the smallest charecter or sequence
of charecters that has a regular pronunciation. But
yes, it's safe to ignore that and assume there's a
cannonical list of "segments" already in memory.

I am indeed associating the segments with values,
hence storing them as keys in a hash. Also, by storing
them that way, if I'm trying to find the values
associated with a given segment, I can quickly find it
by $all_segments{$segment_in_question} rather than
needing to do a for or foreach loop over an array of
an estimated 15..50 items.

The loop based off the longest element thing sounds
like a good idea, I'll see if I can get it to work.

For those who wonder what on earth I'm up to... it's
an OO module for autosegmental phonology. In short you
feed the object a string and an "alphabet" which maps
segments to values ("d" has +voicing, +dental,
-vocalic, etc) and it creates an array of hashes (or
hash of arrays) where the index is the sequence number
of the segment in the string, and where the key is the
name of the "tier" (voicing, dental, vocalic, etc).
Then there'll be ways to muck around with the object
ala phonetic rules. Then there'll be a method to tie
all of the tiers back together into a single string
(per the alphabet) and spit it back out.

__
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/


Re: tricky parsing question

2004-01-22 Thread Chris Devers
On Thu, 22 Jan 2004, wren argetlahm wrote:

> I'm working on a linguistic module and I'm trying to
> find a good way to split a string up into "segments".

Your definition os "segment" here is vague; is it safe to ignore that and
just accept that a canonical list of each language's 'segments' is a
static thing that is already stored as hash keys? 

And for that matter, why a hash? Are you associating values of some kind
with each segment key? If you're not, this could be easier to solve with
plain arrays, since the list of elements can be manually determined:

  @english_segs = qw[ ch sh th ... x y z ];

Or whatever. This way, the common ones can be frontloaded, which may
speed things up a bit.


I bet there's a clever solution to a problem like this in the Perl
Algorithms ("Wolf") book, but I'd have to poke around to find it -- it's
probably presented as the solution to a different problem. 

At a guess, I think you want a loop based on the length of the longest
pre-determined element. Hence, if the longest element is three or four
letters (maybe you count the 'ion' in words like 'traction', or the 'ious'
in words like 'serious' [1]), then you can look at the string in chunks of
that many letters, looking for the longest possible match in your elements
list, then push back whatever is left over after you make a match and
start over again with the next chunk of three or four letters. 


I think I'm starting to describe how to implement the regex engine here :/

Maybe Parse::RecDescent? Maybe I'm over-thinking this...



[1] My copy of /usr/share/dict/words has 75 words with 'ious', but only
one word with 'iou[^s]', so I'm guessing that 'ious' might be taken
as a single entity for your purposes.



-- 
Chris Devers



Re: tricky parsing question

2004-01-22 Thread Rick Measham
On 23 Jan 2004, at 01:21 pm, wren argetlahm wrote:

I'm working on a linguistic module and I'm trying to
find a good way to split a string up into "segments".
I can't assume single charecter strings and want to
assume maximal segments. As an example, the word
"church" would be rendered as the list ('ch', 'u',
'r', 'ch') and wouldn't break the "ch" up smaller even
though both "c" and "h" are valid segments in English.
I have all the valid segments for a given language
stored as keys in a hash, now I just need an algorithm
to chop up a string into a list. Any ideas?
Wren, when you say 'segments' it appears you mean phonemes or phonetics.

CPAN has several modules that may help you:

Lingua::Phoneme uses the Moby Pronounciation Dictionery to find the 
phonemes.

Text::Metaphone also deals with phonemes and will return 'Church' as 
'XRX' meaning 'ch', 'r', 'ch'. Unfortunately it returns the 'ch' in 
'Character' as an 'X' also.

And that, of course, is the most difficult part. English is such a 
hodge-podge of hacks from other languages the understanding it via 
algorithms is very very hard.

Cheers!
Rick


Rick Measham
Senior Designer and Developer
Printaform Pty Ltd
Tel: (03) 9850 3255
Fax: (03) 9850 3277
http://www.printaform.com.au
http://www.printsupply.com.au
vcard: http://www.printaform.com.au/staff/rickm.vcf


tricky parsing question

2004-01-22 Thread wren argetlahm
I'm working on a linguistic module and I'm trying to
find a good way to split a string up into "segments".
I can't assume single charecter strings and want to
assume maximal segments. As an example, the word
"church" would be rendered as the list ('ch', 'u',
'r', 'ch') and wouldn't break the "ch" up smaller even
though both "c" and "h" are valid segments in English.
I have all the valid segments for a given language
stored as keys in a hash, now I just need an algorithm
to chop up a string into a list. Any ideas?

~wren

__
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/