Re: tricky parsing question

Paul Hoffman Thu, 12 Feb 2004 09:28:43 -0800

Sorry for the delay; I'm catching up.

On Thu Jan 22, 2004, wren argetlahm <[EMAIL PROTECTED]> wrote:

I'm working on a linguistic module

Great! I'm an avid would-be linguist myself.

and I'm trying to
find a good way to split a string up into "segments".
I can't assume single charecter strings and want to
assume maximal segments. As an example, the word
"church" would be rendered as the list ('ch', 'u',
'r', 'ch') and wouldn't break the "ch" up smaller even
though both "c" and "h" are valid segments in English.
I have all the valid segments for a given language
stored as keys in a hash, now I just need an algorithm
to chop up a string into a list. Any ideas?

~wren


 1  #!/usr/bin/perl -w
 2
 3  use strict;
 4
 5  my @segments = qw(a b ch d f h i j k m n r s sh t u w y);
 6
 7  my $regex_string
 8      = join( '|', sort { length($b) <=> length($a) } @segments );
 9      # --> 'ch|sh|b|d|f|h|i|j|k|m|n|r|s|a|t|u|w|y'
10
11  my $regex = qr/$regex_string/;
12
13  my $string = 'afashiwaku';
14
15  print join(' ', segments_in($string)), "\n";
16      # --> prints "a f a sh i w a k u\n"
17
18  sub segments_in {
19      my ($str) = @_;
20      return $str =~ /($regex)/g;
21  }

The magic lies in lines 8 (sort in descending order of segment length) and 20 (use the /g regex modifier to get a copy of all segments when called in list context).

Since your data is in a hash, you'll need to use keys(%myhash) in place of @segments in line 8.

HTH,

Paul.

--
Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan
[EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/

Re: tricky parsing question

Reply via email to