Re: tricky parsing question

Bill Stephenson Fri, 23 Jan 2004 05:18:24 -0800

Well, both the problem and the project are way over my head, but it looks to me like something that will only be solved with brute force.

I think if your string is split into words and the segments are sorted longest to shortest, alphabetically, then your could sort words by the first letter in the first segment and move onto the next letter in the next segment and exit the loop for that word when the last segment has been identified. Does that make any sense?

Shawn McKinley wrote the regex gizmo at perlhelp.com. You might send him a note and ask him what he thinks. He lurks on the macosx list, but seldom reads every message. He's pretty good with regex's and your problem is certainly an interesting one.

You can reach him at [EMAIL PROTECTED]

Let me know what you come up with.

Kindest Regards,

Bill Stephenson

On Jan 22, 2004, at 11:28 PM, wren argetlahm wrote:

--- Bill Stephenson <[EMAIL PROTECTED]> wrote:

You need to get a book on regex's.


I know the solution lies in regex's, the problem is
that I can't quite figure out a generic enough way of
doing it. The problem is for a module and so the list
of valid segments is user defined. I guess I could do
something like:

$segs = '('. join('|', @segs) .')';
$string =~ s/^$segs//;
$first_seg = $1;

But I'd have to sort @segs somehow so that the longest
segments come first, and since alphabets can have many
many different segments, I worry about memory issues.

--- Bill Stephenson <[EMAIL PROTECTED]> wrote:

Perl.com has the best available, "Mastering
Regular Expressions" is what you want.

Sounds like a formidable task though. For some
additional help with your regex you can play
with a tool posted on the "perlhelp.com" web
site. Go to "Resources" and look for the
"Regular Expression Explanation Generator".

Thanks, I'll have to check those out sometime.

--- Rick Measham <[EMAIL PROTECTED]> wrote:

Wren, when you say 'segments' it appears you
mean phonemes or phonetics.


Yeah, I do mean phonemes (or something like it). The
module is language independent, but I'll check those
modules out.

--- Chris Devers <[EMAIL PROTECTED]> wrote:

Your definition of "segment" here is vague; is
it safe to ignore that and just accept that a
canonical list of each language's 'segments' is
a static thing that is already stored as hash
keys?


By "segment" I mean the smallest charecter or sequence
of charecters that has a regular pronunciation. But
yes, it's safe to ignore that and assume there's a
cannonical list of "segments" already in memory.

I am indeed associating the segments with values,
hence storing them as keys in a hash. Also, by storing
them that way, if I'm trying to find the values
associated with a given segment, I can quickly find it
by $all_segments{$segment_in_question} rather than
needing to do a for or foreach loop over an array of
an estimated 15..50 items.

The loop based off the longest element thing sounds
like a good idea, I'll see if I can get it to work.

For those who wonder what on earth I'm up to... it's
an OO module for autosegmental phonology. In short you
feed the object a string and an "alphabet" which maps
segments to values ("d" has +voicing, +dental,
-vocalic, etc) and it creates an array of hashes (or
hash of arrays) where the index is the sequence number
of the segment in the string, and where the key is the
name of the "tier" (voicing, dental, vocalic, etc).
Then there'll be ways to muck around with the object
ala phonetic rules. Then there'll be a method to tie
all of the tiers back together into a single string
(per the alphabet) and spit it back out.

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/

Re: tricky parsing question

Reply via email to