Re: Analysis/tokenization of compound words

Jonathan O'Connor Tue, 19 Sep 2006 09:51:43 -0700

Otis,
I can't offer you any practical advice, but as a student of German, I can tell you that beginners find it difficult to read German words and split them properly. The larger your vocabulary the easier it is. The whole topic sounds like an AI problem:
A possible algorithm for German (no idea if this would also work for English or agglutinative languages like Turkish) might be:
1. Search for the whole word in the dictionary. If found end
2. Split the word into syllables (this might be another AI project too).
3. Join the syllables together and see if they make words in the dictionary.
4. If all the syllables are used in known words, then you have success.
5. An heuristic to use is to create words as long as possible.

E.g. "Balletttänzerin" (Balletttaenzerin if you can't read umlauts).
Syllables: "Ball", "ett", "taenz", "er", "in"
Joining the syllables, we see that "Ball" is in our dictionary, but "etttaenzerin", "etttaenzer" , "etttaenz" and "ett" are not. So on we go:
"Ballett" is in our dictionary, and "taenzerin" is also. Note if we went for the short words first, then we could split it into: Ballett | taenzer | in.

As usual, its an interesting project with no 100% perfect solution. Best of luck
Jonathan O'Connor
XCOM Dublin
Otis Gospodnetic <[EMAIL PROTECTED]>

Otis Gospodnetic <[EMAIL PROTECTED]>

19/09/2006 17:21

Please respond to
java-user@lucene.apache.org

To	java-user@lucene.apache.org
cc
Subject	Analysis/tokenization of compound words

Hi, How do people typically analyze/tokenize text with compounds (e.g. German)? I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all. One way to go about this is to have a word dictionary and a tokenizer that processes input one character at a time, looking for a word match in the dictionary after each processed characters. Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at a set of tokens at the same position. However, somehow this doesn't strike me as a very smart and fast approach. What are some better approaches? If anyone has implemented anything that deals with this problem, I'd love to hear about it. Thanks, Otis --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use of the intended recipient. Any review, distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

Hauptsitz: Bahnstrasse 33, D-47877 Willich, USt-IdNr.: DE 812 885 664
Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900, www.xcom.de
Handelsregister: Amtsgericht Krefeld, HRB 10340
Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr. Rainer Fuchs
Vorsitzender des Aufsichtsrates: Stephan Steuer

Re: Analysis/tokenization of compound words

Reply via email to