Re: tricky parsing question

Rick Measham Thu, 22 Jan 2004 20:06:38 -0800

On 23 Jan 2004, at 01:21 pm, wren argetlahm wrote:

I'm working on a linguistic module and I'm trying to
find a good way to split a string up into "segments".
I can't assume single charecter strings and want to
assume maximal segments. As an example, the word
"church" would be rendered as the list ('ch', 'u',
'r', 'ch') and wouldn't break the "ch" up smaller even
though both "c" and "h" are valid segments in English.
I have all the valid segments for a given language
stored as keys in a hash, now I just need an algorithm
to chop up a string into a list. Any ideas?

Wren, when you say 'segments' it appears you mean phonemes or phonetics.

CPAN has several modules that may help you:

Lingua::Phoneme uses the Moby Pronounciation Dictionery to find the phonemes.

Text::Metaphone also deals with phonemes and will return 'Church' as 'XRX' meaning 'ch', 'r', 'ch'. Unfortunately it returns the 'ch' in 'Character' as an 'X' also.

And that, of course, is the most difficult part. English is such a hodge-podge of hacks from other languages the understanding it via algorithms is very very hard.

Cheers!
Rick


Rick Measham
Senior Designer and Developer

Printaform Pty Ltd
Tel: (03) 9850 3255
Fax: (03) 9850 3277
http://www.printaform.com.au
http://www.printsupply.com.au
vcard: http://www.printaform.com.au/staff/rickm.vcf

Re: tricky parsing question

Reply via email to