Am 25.01.2011 21:01, schrieb Gary Gregory:
Hi All:
I just found a data set that I would like to integrate with [codec] to test the
language package:
http://sourceforge.net/projects/familynamephon/
The test data file contains 837K German names (37MB) in a text file and
encodings in Cham (?) phonetics, Cologne phonetics, Metaphone, and Soundex.
I have no idea how long it would take to run a test for our language encoders
on this but I imagine making it an optional unit test. How do you do THAT in
Maven?
The data is covered (I think, I do not read German) by this license:
http://www.opendatacommons.org/licenses/odbl/1.0/
Being a native German speaker I can confirm that the license is actually
the Open Database License which can be found at the URL you provided.
Cham phonetics seems to be a special algorithm for encoding names. [1]
contains more background information about it (unfortunately also in
German). According to this page the name stems from a region in Bavaria.
You can find a PHP implementation of this algorithm in [2].
HTH
Oliver
[1] http://www.genealogie-konzepte.net/chamer-phonetik
[2] http://www.genealogie-konzepte.net/chamer-phonetik/implementierung
Thoughts?
Gary Gregory
Senior Software Engineer
Rocket Software
3340 Peachtree Road, Suite 820 * Atlanta, GA 30326 * USA
Tel: +1.404.760.1560
Email: ggreg...@seagullsoftware.com<mailto:ggreg...@seagullsoftware.com>
Web: seagull.rocketsoftware.com<http://www.seagull.rocketsoftware.com/>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org