Hi Andreas,

We need training data to build OpenNLP models. One of the approach that
will work is to use the Heuristics (Regex, etc., i.e. Step-1) to build the
crude enhancement engine. You can start working with that engine and try to
correct the system by user feedback. These feedback along-with annotations
can then be used as training data to build OpenNLP model (Step-2).

If you can find a german annotated corpus, that would be great. Often
datasets for other use cases also have addresses, like this one-
http://vocabulary.wolterskluwer.de/ (See court thesaurus RDF, specifically
street-address property)

You can find such datasets to kickstart the enhancement engine. Do review
the license information as some of them may be available only for research
purpose.

Regards,
Anuj


On Mon, Apr 21, 2014 at 4:52 PM, Andreas Kuckartz <a.kucka...@ping.de>wrote:

> I am about to create an enhancement engine to recognize street names.
> More precisely: Names of streets in Germany contained in German language
> texts.
>
> These are possible approaches:
>
> 1. Using simple heuristics such as these:
>
> Everything beginning with a capital letter and ending with "str.",
> "strasse" or "straße" or " Str." etc. is a street name. Similar for
> "Ring", "Allee" etc. And if a blank and an integer or something like
> "5B" follows that can be considered to be the corresponding street number.
>
> 2. Create an OpenNLP NameFinder model for the "OpenNLP Custom NER Model
> Engine".
>
> Creating such a model seems to require a lot of data:
>
> "The training data should contain at least 15000 sentences to create a
> model which performs well."
> See:
>
> http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind
>
> Can such models be created without training data?
> Are there other suggestions?
>
> Cheers,
> Andreas
>

Reply via email to