I'm working on a linguistic module and I'm trying to find a good way to split a string up into "segments". I can't assume single charecter strings and want to assume maximal segments. As an example, the word "church" would be rendered as the list ('ch', 'u', 'r', 'ch') and wouldn't break the "ch" up smaller even though both "c" and "h" are valid segments in English. I have all the valid segments for a given language stored as keys in a hash, now I just need an algorithm to chop up a string into a list. Any ideas?
Wren, when you say 'segments' it appears you mean phonemes or phonetics.
CPAN has several modules that may help you:
Lingua::Phoneme uses the Moby Pronounciation Dictionery to find the phonemes.
Text::Metaphone also deals with phonemes and will return 'Church' as 'XRX' meaning 'ch', 'r', 'ch'. Unfortunately it returns the 'ch' in 'Character' as an 'X' also.
And that, of course, is the most difficult part. English is such a hodge-podge of hacks from other languages the understanding it via algorithms is very very hard.
Cheers! Rick
Rick Measham Senior Designer and Developer
Printaform Pty Ltd Tel: (03) 9850 3255 Fax: (03) 9850 3277 http://www.printaform.com.au http://www.printsupply.com.au vcard: http://www.printaform.com.au/staff/rickm.vcf