First of all I should probably congratulate my fellow Germans -- Dirk Nowitzki's outstanding performance during this year's NBA finals will become part of the history of basketball. As a Pole, I admit I'm really freaking jealous.
Now... back to the subject. A number of people have expressed an interest in a decompounding engine for German recently (we talked about it during Berlin Buzzwords, among other occasions). I did some research on the subject (even though I don't know the language): - a few commercial products exist (usually paired with morphological analyzers); their quality seems to be very good, http://tagh.com among others, for example; - research papers on the subject also exist, including a project by Torsten Marek that is readily available and uses FSTs to model the probabilities of word links; unfortunately the evaluation data set seems to be skewed and is not usable; - Daniel Naber maintains jWordSplitter project on SourceForge; this is a greedy heuristic backed by a morphological (static) dictionary; this works surprisingly well in practice (we cannot measure the quality due to lack of a proper evaluation data set -- see below). In the past few days I've played with a number of resources of German words and n-grams (Google, the dictionary in languagetool and jWordSplitter, dewac corpus) and my gut feeling is that it is not possible to provide a "perfect" solution, but something that will work in a large majority of times is achievable through a heuristic much like the one implemented in jWordSplitter. The advantage of this approach is that we don't need a fully blown POS dictionary or deep contextual disambiguation (and we can treat unknown words to some degree). Disadvantage: there will be errors resulting from ambiguities and improper assumptions. As a start I have (re)implemented a naive heuristic that splits compounds based on a dictionary of surface forms and a predefined set of glue morphemes (the dictionary is under CC-SA: http://creativecommons.org/licenses/by-sa/3.0/, which seems to be accepted by Apache based on this post: http://www.apache.org/legal/resolved.html#cc-sa). But in order to develop it further and improve it, we REALLY need a "golden standard" file; something that will include known compound splits and serve as the benchmark we refer to when trying new algorithms or ideas. And here comes your part: if you are a speaker of German and would like to help, you are more than welcome to. The project is hosted at github at the moment, here: https://github.com/dweiss/compound-splitter the 'test file' is in src/test/resources/test-compounds.utf8 and README contains instructions on adding new test cases. You can either fork the project on github or e-mail your compounds back to me, whatever. I don't expect a full consensus among humans as to which splits are legitimate and which are invalid, so you can also review/ comment on the existing test cases. If you're looking for inspiration on where to get compounds to tag/split, Google n-grams is your friend. I added a google-ngrams.bycount file that lists surface words with aggregated counts between 1980 and 2008 or something. Pick a spot on that list and decompound, decompound :) If you wish to do something else, there is another file called: morphy-google-intersect.20000 and this one contains an intersection of words not present in morphy (the german dictionary we use for decompounding). Lots of these are foreign words, but there is a fair share of German words (and, hint, hint, compounds) that are simply newer or inflected in weird ways. Let's see where we can take this. Dawid --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org