On Thu Jan 22, 2004, wren argetlahm <[EMAIL PROTECTED]> wrote:
I'm working on a linguistic module
Great! I'm an avid would-be linguist myself.
and I'm trying to find a good way to split a string up into "segments". I can't assume single charecter strings and want to assume maximal segments. As an example, the word "church" would be rendered as the list ('ch', 'u', 'r', 'ch') and wouldn't break the "ch" up smaller even though both "c" and "h" are valid segments in English. I have all the valid segments for a given language stored as keys in a hash, now I just need an algorithm to chop up a string into a list. Any ideas?
~wren
1 #!/usr/bin/perl -w 2 3 use strict; 4 5 my @segments = qw(a b ch d f h i j k m n r s sh t u w y); 6 7 my $regex_string 8 = join( '|', sort { length($b) <=> length($a) } @segments ); 9 # --> 'ch|sh|b|d|f|h|i|j|k|m|n|r|s|a|t|u|w|y' 10 11 my $regex = qr/$regex_string/; 12 13 my $string = 'afashiwaku'; 14 15 print join(' ', segments_in($string)), "\n"; 16 # --> prints "a f a sh i w a k u\n" 17 18 sub segments_in { 19 my ($str) = @_; 20 return $str =~ /($regex)/g; 21 }
The magic lies in lines 8 (sort in descending order of segment length) and 20 (use the /g regex modifier to get a copy of all segments when called in list context).
Since your data is in a hash, you'll need to use keys(%myhash) in place of @segments in line 8.
HTH,
Paul.
-- Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan [EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/