A long time ago I prototyped a word uncompounder for Dutch.
Though it worked, it was far from elegant and supporting only Dutch.

Earlier this week I found a more elegant soution, able to uncompound 
words like
'langetermijnplanning' into 'lange termijn planning'.

In Dutch there are 4 possible compounding insertions: none (word+word), 
an s (word+s+word), a dash (word+-+word) and the combination (word+s-+word).
The number of parts in the compound is not limited in any way 
(theoretically).
Generally, uncompounding works well with parts of at least 5 chars. 
Shorter parts lead to wrongly uncompounded words. Some parts of shorter 
length are still safe to use though (e.g. jazz).

Now my question:  What about other languages?
- Is your language compounding or not?
* Are there special situations when compounding, like changing the 
letters on the concatenation point?
- which cancatenation insertions are there for your language?
- Which part of the compound is sematically the essence of the word ( 
langetermijnplanning, long term plan, is mostly a plan, term and long 
are specifiers)

When I know a bit more, I could try to adjust the prototype code to 
support multiple languages by design.

Thanks in advance,

Ruud

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to