Re: tricky parsing question
Wren Argetlahm wrote: I'm working on a linguistic module and I'm trying to find a good way to split a string up into "segments". I can't assume single charecter strings and want to assume maximal segments. As an example, the word "church" would be rendered as the list ('ch', 'u', 'r', 'ch') and wouldn't break the "ch" up smaller even though both "c" and "h" are valid segments in English. I have all the valid segments for a given language stored as keys in a hash, now I just need an algorithm to chop up a string into a list. Any ideas? ~wren Why do you need to sort by alphabet after sorting by length? How about something like this: #!/usr/bin/perl -w use strict; my $text = "church"; my @alphabet = qw(c h ch u r); # make a regex with many alterations (compiled only once) # that puts the longest characters first my $letters = join "|" => reverse sort @alphabet; my $regex = qr/($letters)/; # then try to remove the first character from the string # using the regex (it matches the longest segments first) # and add it to @segments until our character doesn't # match anymore my @segments; push @segments, $1 while $text =~ s/^$regex//; print join "-" => @segments; Hope this helps, matt diephouse - http://matt.diephouse.com
Re: tricky parsing question
Sorry for the delay; I'm catching up. On Thu Jan 22, 2004, wren argetlahm <[EMAIL PROTECTED]> wrote: I'm working on a linguistic module Great! I'm an avid would-be linguist myself. and I'm trying to find a good way to split a string up into "segments". I can't assume single charecter strings and want to assume maximal segments. As an example, the word "church" would be rendered as the list ('ch', 'u', 'r', 'ch') and wouldn't break the "ch" up smaller even though both "c" and "h" are valid segments in English. I have all the valid segments for a given language stored as keys in a hash, now I just need an algorithm to chop up a string into a list. Any ideas? ~wren 1 #!/usr/bin/perl -w 2 3 use strict; 4 5 my @segments = qw(a b ch d f h i j k m n r s sh t u w y); 6 7 my $regex_string 8 = join( '|', sort { length($b) <=> length($a) } @segments ); 9 # --> 'ch|sh|b|d|f|h|i|j|k|m|n|r|s|a|t|u|w|y' 10 11 my $regex = qr/$regex_string/; 12 13 my $string = 'afashiwaku'; 14 15 print join(' ', segments_in($string)), "\n"; 16 # --> prints "a f a sh i w a k u\n" 17 18 sub segments_in { 19 my ($str) = @_; 20 return $str =~ /($regex)/g; 21 } The magic lies in lines 8 (sort in descending order of segment length) and 20 (use the /g regex modifier to get a copy of all segments when called in list context). Since your data is in a hash, you'll need to use keys(%myhash) in place of @segments in line 8. HTH, Paul. -- Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan [EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/
Re: tricky parsing question
--- Chris Devers <[EMAIL PROTECTED]> wrote: > Do you need to handle ambiguities? For example, > "-ough" can famously be pronounced several ways: The way it's set up now can't deal with them, but I'm about to rewrite the thing to handle more than one segment having the same orthographic representation. Output can be ambiguous without any problems so long as a given bundle of features has only one representation. But the input needs to be non-ambiguous on some level, so you'd want to either assume all "-ough" are the same phoneme and that pronunciation depends on context (a bad assumption in this case), or you'd want to find some way to differentiate them (e.g. by typing your string in as IPA rather than English, or by "making up" segments like "ough{ow}", "ough{off}", etc). Generally, it should be able be swept under the carpet of predetermined lists. > Okay, but I still think that attacking this > problem will be easier if you start out with > these elements in a normal, hand-ordered list, > and then pre-populate the keys of one or more > hashes based on that. My current approach sorts segments by length (long to short) then alphabetically. This allows users to list the segments in any order they please. Maybe they want to list "sch" with "ch" in their German alphabet, or maybe they want to list "sch" with "s"; I think the program should be agnostic as to the order in which they're stored in the file. The only reason I can think of for hand-ordering is if that order represents information that can't be otherwise calculated (such as ordering them by their frequency in the given language in order to speed up parsing input). This is what I'm currently using to get from $string to @segments: my (@length, @letters, @segments, $letters); foreach (keys %{$this->{'alphabet'}}) { push @{$length[length($_)-1]}, $_; } for (reverse [EMAIL PROTECTED]) { push @letters, sort @{$length[$_]}; } $letters = '(' . join('|', @letters) . ')'; while ($string) { if ($string =~ s/^$letters//) { push @segments, $1; } else { return &error("can't parse next segment in '$string'"); } } Using substr() might speed things up, but for now I'm just hoping to get it all to work. > I have a feeling that decomposing the string > into an array of characters might help In short, that's what I'm doing, only replacing the word "characters" with "segments" (which are generally one character long, but not always). I'll look into Parse::RecDescent, but so far I seem unable to grok what I've read about it. ~wren __ Do you Yahoo!? Yahoo! SiteBuilder - Free web site building tool. Try it! http://webhosting.yahoo.com/ps/sb/
Re: tricky parsing question
On Thu, 22 Jan 2004, wren argetlahm wrote: > --- Rick Measham <[EMAIL PROTECTED]> wrote: > > Wren, when you say 'segments' it appears you > > mean phonemes or phonetics. > > Yeah, I do mean phonemes (or something like it). The > module is language independent, but I'll check those > modules out. That's probably a good approach. Even if one of the existing linguistics modules isn't quite right, if it's good OO code you should be able to sub-class it to add the functionality you need. It seems like there are a lot of subtleties to be considered; standing on the shoulders of others who have worked on this may save a lot of pain. > --- Chris Devers <[EMAIL PROTECTED]> wrote: > > Your definition of "segment" here is vague; is > > it safe to ignore that and just accept that a > > canonical list of each language's 'segments' is > > a static thing that is already stored as hash > > keys? > > By "segment" I mean the smallest charecter or sequence > of charecters that has a regular pronunciation. But > yes, it's safe to ignore that and assume there's a > canonical list of "segments" already in memory. Do you need to handle ambiguities? For example, "-ough" can famously be pronounced several ways: bough -> 'bow' cough -> 'koff' dough -> 'doh' rough -> 'ruff' tough -> 'tuff' And my copy of /usr/share/dict/words also has words I don't know: hough -> 'hock' [seems to be synonymous with 'hock', meaning a bone joint, used by butchers for pork] jough -> I can't find a definition or pronounciation lough -> 'lock' [same as 'loch', as in 'Loch Ness Monster'] sough -> 'soh' [synonymous with 'field that is farmed' or 'groan'] wough -> I can't find a definition or pronunciation So that gives at least four or five ways to say '-ough', and maybe more. Does your code need to handle such things? Or can we, again, assume that this has been swept under the carpet of predetermined lists? > I am indeed associating the segments with values, > hence storing them as keys in a hash. Okay, but I still think that attacking this problem will be easier if you start out with these elements in a normal, hand-ordered list, and then pre-populate the keys of one or more hashes based on that. So, making this up and not intending this to be a perfect or complete approach to things: my ( @en_segs, @fr_segs, @de_segs, [], %en_hash, %fr_hash, %de_hash, [] ); # populate the arrays with predetermined lists @en_segs = qw [ # English ough ious ion [] # four & three letter segments ch sh th [] # two letter segments x y z# one letter segments ]; @fr_segs = qw[ [] ]; # French, repeat as above with English @de_segs = qw[ [] ]; # German, and so on # convert those lists to hashes # probably a more idiomatic way to do this, but whatever: foreach $key @en_segs { %en_hash{$_} = ""; } # this might be the more idiomatic way? this is untested... %fr_hash{$_} = "" foreach @fr_segs; %de_hash{$_} = "" foreach @de_segs; # etc for other languages At that point, you've got the data stored twice, and can begin working: my ( $string, $max_seg, $offset, $cur_str ); $string = 'supercalafragalisticixpyaladocious'; $max_seg = length( $en_segs[0] ); # ^ because the array is hand sorted, you know that the # first element will always have the longest segment $offset = 0; $cur_str = substr( $string, $offset, $max_seg); (If it's not obvious, I'm making this up as I type -- improvements are welcome.) From this point, you need to "walk the string", getting substrings into $cur_str, then looking for the longest part of that substring that exists in @en_segs, then incrementing $offset based on the longest match you found, moving to the next $cur_str, and repeat until the string is exhausted. Obviously, the last line or two up there ought to be wrapped in a while loop or something to make this work. I have a feeling that grep may help with the array lookups, but I can't think of how to phrase the line[s] that would do it. I have a feeling that decomposing the string into an array of characters might help (maybe with grep, etc), but then I have a feeling that doing that would be treating this too much like a C program, and Perl shouldn't have to stoop to parsing strings the way C does. I still have a feeling that Parse::RecDescent would make all of this a lot easier, but I'm not the one to walk you through using that module. This really does seem like the sort of problem that RecDescent is best at though, so it's worth looking up some of Damian Conway's documentation for the module. If you can get your head around it, it's probably *way* more effective than most any other approach anyone could suggest. -- Chris Devers
Re: tricky parsing question
> On Thu, 22 Jan 2004, wren argetlahm wrote: > > > Maybe Parse::RecDescent? Maybe I'm over-thinking this... > > This is what I thought of immediately, an old but excellent article maybe a good place to start: http://search.cpan.org/src/DCONWAY/Parse-RecDescent-1.94/tutorial/tutorial.html http://danconia.org
Re: tricky parsing question
On Thu, Jan 22, 2004 at 09:28:30PM -0800, wren argetlahm wrote: > --- Bill Stephenson <[EMAIL PROTECTED]> wrote: > > You need to get a book on regex's. > I know the solution lies in regex's I don't. I expect the code would be a lot clearer and considerably quicker if you pull your strings apart using index(), rindex() and substr(). -- David Cantrell | Degenerate | http://www.cantrell.org.uk/david If you have received this email in error, please add some nutmeg and egg whites, whisk, and place in a warm oven for 40 minutes.
Re: tricky parsing question
Well, both the problem and the project are way over my head, but it looks to me like something that will only be solved with brute force. I think if your string is split into words and the segments are sorted longest to shortest, alphabetically, then your could sort words by the first letter in the first segment and move onto the next letter in the next segment and exit the loop for that word when the last segment has been identified. Does that make any sense? Shawn McKinley wrote the regex gizmo at perlhelp.com. You might send him a note and ask him what he thinks. He lurks on the macosx list, but seldom reads every message. He's pretty good with regex's and your problem is certainly an interesting one. You can reach him at [EMAIL PROTECTED] Let me know what you come up with. Kindest Regards, Bill Stephenson On Jan 22, 2004, at 11:28 PM, wren argetlahm wrote: --- Bill Stephenson <[EMAIL PROTECTED]> wrote: You need to get a book on regex's. I know the solution lies in regex's, the problem is that I can't quite figure out a generic enough way of doing it. The problem is for a module and so the list of valid segments is user defined. I guess I could do something like: $segs = '('. join('|', @segs) .')'; $string =~ s/^$segs//; $first_seg = $1; But I'd have to sort @segs somehow so that the longest segments come first, and since alphabets can have many many different segments, I worry about memory issues. --- Bill Stephenson <[EMAIL PROTECTED]> wrote: Perl.com has the best available, "Mastering Regular Expressions" is what you want. Sounds like a formidable task though. For some additional help with your regex you can play with a tool posted on the "perlhelp.com" web site. Go to "Resources" and look for the "Regular Expression Explanation Generator". Thanks, I'll have to check those out sometime. --- Rick Measham <[EMAIL PROTECTED]> wrote: Wren, when you say 'segments' it appears you mean phonemes or phonetics. Yeah, I do mean phonemes (or something like it). The module is language independent, but I'll check those modules out. --- Chris Devers <[EMAIL PROTECTED]> wrote: Your definition of "segment" here is vague; is it safe to ignore that and just accept that a canonical list of each language's 'segments' is a static thing that is already stored as hash keys? By "segment" I mean the smallest charecter or sequence of charecters that has a regular pronunciation. But yes, it's safe to ignore that and assume there's a cannonical list of "segments" already in memory. I am indeed associating the segments with values, hence storing them as keys in a hash. Also, by storing them that way, if I'm trying to find the values associated with a given segment, I can quickly find it by $all_segments{$segment_in_question} rather than needing to do a for or foreach loop over an array of an estimated 15..50 items. The loop based off the longest element thing sounds like a good idea, I'll see if I can get it to work. For those who wonder what on earth I'm up to... it's an OO module for autosegmental phonology. In short you feed the object a string and an "alphabet" which maps segments to values ("d" has +voicing, +dental, -vocalic, etc) and it creates an array of hashes (or hash of arrays) where the index is the sequence number of the segment in the string, and where the key is the name of the "tier" (voicing, dental, vocalic, etc). Then there'll be ways to muck around with the object ala phonetic rules. Then there'll be a method to tie all of the tiers back together into a single string (per the alphabet) and spit it back out. __ Do you Yahoo!? Yahoo! SiteBuilder - Free web site building tool. Try it! http://webhosting.yahoo.com/ps/sb/
Re: tricky parsing question
wren, You need to get a book on regex's. Perl.com has the best available, "Mastering Regular Expressions" is what you want. Sounds like a formidable task though. For some additional help with your regex you can play with a tool posted on the "perlhelp.com" web site. Go to "Resources" and look for the "Regular Expression Explanation Generator". Kindest Regards, Bill Stephenson On Jan 22, 2004, at 8:21 PM, wren argetlahm wrote: I'm working on a linguistic module and I'm trying to find a good way to split a string up into "segments". I can't assume single charecter strings and want to assume maximal segments. As an example, the word "church" would be rendered as the list ('ch', 'u', 'r', 'ch') and wouldn't break the "ch" up smaller even though both "c" and "h" are valid segments in English. I have all the valid segments for a given language stored as keys in a hash, now I just need an algorithm to chop up a string into a list. Any ideas? ~wren __ Do you Yahoo!? Yahoo! SiteBuilder - Free web site building tool. Try it! http://webhosting.yahoo.com/ps/sb/
Re: tricky parsing question
--- Bill Stephenson <[EMAIL PROTECTED]> wrote: > You need to get a book on regex's. I know the solution lies in regex's, the problem is that I can't quite figure out a generic enough way of doing it. The problem is for a module and so the list of valid segments is user defined. I guess I could do something like: $segs = '('. join('|', @segs) .')'; $string =~ s/^$segs//; $first_seg = $1; But I'd have to sort @segs somehow so that the longest segments come first, and since alphabets can have many many different segments, I worry about memory issues. --- Bill Stephenson <[EMAIL PROTECTED]> wrote: > Perl.com has the best available, "Mastering > Regular Expressions" is what you want. > > Sounds like a formidable task though. For some > additional help with your regex you can play > with a tool posted on the "perlhelp.com" web > site. Go to "Resources" and look for the > "Regular Expression Explanation Generator". Thanks, I'll have to check those out sometime. --- Rick Measham <[EMAIL PROTECTED]> wrote: > Wren, when you say 'segments' it appears you > mean phonemes or phonetics. Yeah, I do mean phonemes (or something like it). The module is language independent, but I'll check those modules out. --- Chris Devers <[EMAIL PROTECTED]> wrote: > Your definition of "segment" here is vague; is > it safe to ignore that and just accept that a > canonical list of each language's 'segments' is > a static thing that is already stored as hash > keys? By "segment" I mean the smallest charecter or sequence of charecters that has a regular pronunciation. But yes, it's safe to ignore that and assume there's a cannonical list of "segments" already in memory. I am indeed associating the segments with values, hence storing them as keys in a hash. Also, by storing them that way, if I'm trying to find the values associated with a given segment, I can quickly find it by $all_segments{$segment_in_question} rather than needing to do a for or foreach loop over an array of an estimated 15..50 items. The loop based off the longest element thing sounds like a good idea, I'll see if I can get it to work. For those who wonder what on earth I'm up to... it's an OO module for autosegmental phonology. In short you feed the object a string and an "alphabet" which maps segments to values ("d" has +voicing, +dental, -vocalic, etc) and it creates an array of hashes (or hash of arrays) where the index is the sequence number of the segment in the string, and where the key is the name of the "tier" (voicing, dental, vocalic, etc). Then there'll be ways to muck around with the object ala phonetic rules. Then there'll be a method to tie all of the tiers back together into a single string (per the alphabet) and spit it back out. __ Do you Yahoo!? Yahoo! SiteBuilder - Free web site building tool. Try it! http://webhosting.yahoo.com/ps/sb/
Re: tricky parsing question
On Thu, 22 Jan 2004, wren argetlahm wrote: > I'm working on a linguistic module and I'm trying to > find a good way to split a string up into "segments". Your definition os "segment" here is vague; is it safe to ignore that and just accept that a canonical list of each language's 'segments' is a static thing that is already stored as hash keys? And for that matter, why a hash? Are you associating values of some kind with each segment key? If you're not, this could be easier to solve with plain arrays, since the list of elements can be manually determined: @english_segs = qw[ ch sh th ... x y z ]; Or whatever. This way, the common ones can be frontloaded, which may speed things up a bit. I bet there's a clever solution to a problem like this in the Perl Algorithms ("Wolf") book, but I'd have to poke around to find it -- it's probably presented as the solution to a different problem. At a guess, I think you want a loop based on the length of the longest pre-determined element. Hence, if the longest element is three or four letters (maybe you count the 'ion' in words like 'traction', or the 'ious' in words like 'serious' [1]), then you can look at the string in chunks of that many letters, looking for the longest possible match in your elements list, then push back whatever is left over after you make a match and start over again with the next chunk of three or four letters. I think I'm starting to describe how to implement the regex engine here :/ Maybe Parse::RecDescent? Maybe I'm over-thinking this... [1] My copy of /usr/share/dict/words has 75 words with 'ious', but only one word with 'iou[^s]', so I'm guessing that 'ious' might be taken as a single entity for your purposes. -- Chris Devers
Re: tricky parsing question
On 23 Jan 2004, at 01:21 pm, wren argetlahm wrote: I'm working on a linguistic module and I'm trying to find a good way to split a string up into "segments". I can't assume single charecter strings and want to assume maximal segments. As an example, the word "church" would be rendered as the list ('ch', 'u', 'r', 'ch') and wouldn't break the "ch" up smaller even though both "c" and "h" are valid segments in English. I have all the valid segments for a given language stored as keys in a hash, now I just need an algorithm to chop up a string into a list. Any ideas? Wren, when you say 'segments' it appears you mean phonemes or phonetics. CPAN has several modules that may help you: Lingua::Phoneme uses the Moby Pronounciation Dictionery to find the phonemes. Text::Metaphone also deals with phonemes and will return 'Church' as 'XRX' meaning 'ch', 'r', 'ch'. Unfortunately it returns the 'ch' in 'Character' as an 'X' also. And that, of course, is the most difficult part. English is such a hodge-podge of hacks from other languages the understanding it via algorithms is very very hard. Cheers! Rick Rick Measham Senior Designer and Developer Printaform Pty Ltd Tel: (03) 9850 3255 Fax: (03) 9850 3277 http://www.printaform.com.au http://www.printsupply.com.au vcard: http://www.printaform.com.au/staff/rickm.vcf
tricky parsing question
I'm working on a linguistic module and I'm trying to find a good way to split a string up into "segments". I can't assume single charecter strings and want to assume maximal segments. As an example, the word "church" would be rendered as the list ('ch', 'u', 'r', 'ch') and wouldn't break the "ch" up smaller even though both "c" and "h" are valid segments in English. I have all the valid segments for a given language stored as keys in a hash, now I just need an algorithm to chop up a string into a list. Any ideas? ~wren __ Do you Yahoo!? Yahoo! SiteBuilder - Free web site building tool. Try it! http://webhosting.yahoo.com/ps/sb/