I think if your string is split into words and the segments are sorted longest to shortest, alphabetically, then your could sort words by the first letter in the first segment and move onto the next letter in the next segment and exit the loop for that word when the last segment has been identified. Does that make any sense?
Shawn McKinley wrote the regex gizmo at perlhelp.com. You might send him a note and ask him what he thinks. He lurks on the macosx list, but seldom reads every message. He's pretty good with regex's and your problem is certainly an interesting one.
You can reach him at [EMAIL PROTECTED]
Let me know what you come up with.
Kindest Regards,
Bill Stephenson
On Jan 22, 2004, at 11:28 PM, wren argetlahm wrote:
--- Bill Stephenson <[EMAIL PROTECTED]> wrote:You need to get a book on regex's.
I know the solution lies in regex's, the problem is that I can't quite figure out a generic enough way of doing it. The problem is for a module and so the list of valid segments is user defined. I guess I could do something like:
$segs = '('. join('|', @segs) .')'; $string =~ s/^$segs//; $first_seg = $1;
But I'd have to sort @segs somehow so that the longest segments come first, and since alphabets can have many many different segments, I worry about memory issues.
--- Bill Stephenson <[EMAIL PROTECTED]> wrote:Perl.com has the best available, "Mastering Regular Expressions" is what you want.
Sounds like a formidable task though. For some additional help with your regex you can play with a tool posted on the "perlhelp.com" web site. Go to "Resources" and look for the "Regular Expression Explanation Generator".
Thanks, I'll have to check those out sometime.
--- Rick Measham <[EMAIL PROTECTED]> wrote:Wren, when you say 'segments' it appears you mean phonemes or phonetics.
Yeah, I do mean phonemes (or something like it). The module is language independent, but I'll check those modules out.
--- Chris Devers <[EMAIL PROTECTED]> wrote:Your definition of "segment" here is vague; is it safe to ignore that and just accept that a canonical list of each language's 'segments' is a static thing that is already stored as hash keys?
By "segment" I mean the smallest charecter or sequence of charecters that has a regular pronunciation. But yes, it's safe to ignore that and assume there's a cannonical list of "segments" already in memory.
I am indeed associating the segments with values, hence storing them as keys in a hash. Also, by storing them that way, if I'm trying to find the values associated with a given segment, I can quickly find it by $all_segments{$segment_in_question} rather than needing to do a for or foreach loop over an array of an estimated 15..50 items.
The loop based off the longest element thing sounds like a good idea, I'll see if I can get it to work.
For those who wonder what on earth I'm up to... it's an OO module for autosegmental phonology. In short you feed the object a string and an "alphabet" which maps segments to values ("d" has +voicing, +dental, -vocalic, etc) and it creates an array of hashes (or hash of arrays) where the index is the sequence number of the segment in the string, and where the key is the name of the "tier" (voicing, dental, vocalic, etc). Then there'll be ways to muck around with the object ala phonetic rules. Then there'll be a method to tie all of the tiers back together into a single string (per the alphabet) and spit it back out.
__________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free web site building tool. Try it! http://webhosting.yahoo.com/ps/sb/
