I am about to create an enhancement engine to recognize street names. More precisely: Names of streets in Germany contained in German language texts.
These are possible approaches: 1. Using simple heuristics such as these: Everything beginning with a capital letter and ending with "str.", "strasse" or "straße" or " Str." etc. is a street name. Similar for "Ring", "Allee" etc. And if a blank and an integer or something like "5B" follows that can be considered to be the corresponding street number. 2. Create an OpenNLP NameFinder model for the "OpenNLP Custom NER Model Engine". Creating such a model seems to require a lot of data: "The training data should contain at least 15000 sentences to create a model which performs well." See: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind Can such models be created without training data? Are there other suggestions? Cheers, Andreas