On Thu, 22 Jan 2004, wren argetlahm wrote:

> I'm working on a linguistic module and I'm trying to
> find a good way to split a string up into "segments".

Your definition os "segment" here is vague; is it safe to ignore that and
just accept that a canonical list of each language's 'segments' is a
static thing that is already stored as hash keys? 

And for that matter, why a hash? Are you associating values of some kind
with each segment key? If you're not, this could be easier to solve with
plain arrays, since the list of elements can be manually determined:

  @english_segs = qw[ ch sh th ... x y z ];

Or whatever. This way, the common ones can be frontloaded, which may
speed things up a bit.


I bet there's a clever solution to a problem like this in the Perl
Algorithms ("Wolf") book, but I'd have to poke around to find it -- it's
probably presented as the solution to a different problem. 

At a guess, I think you want a loop based on the length of the longest
pre-determined element. Hence, if the longest element is three or four
letters (maybe you count the 'ion' in words like 'traction', or the 'ious'
in words like 'serious' [1]), then you can look at the string in chunks of
that many letters, looking for the longest possible match in your elements
list, then push back whatever is left over after you make a match and
start over again with the next chunk of three or four letters. 


I think I'm starting to describe how to implement the regex engine here :/

Maybe Parse::RecDescent? Maybe I'm over-thinking this...



[1] My copy of /usr/share/dict/words has 75 words with 'ious', but only
    one word with 'iou[^s]', so I'm guessing that 'ious' might be taken
    as a single entity for your purposes.



-- 
Chris Devers

Reply via email to