I'm working on NameFinder too. How can I determine the right parameters (iterations, cutoff and feature generation) for my use case? Is there any guideline?

Thanks,
    Riccardo

On 18/01/2012 22:15, Benson Margulies wrote:
Here at Basis, we train an English Uppercase model by just uppercasing
our training data. The accuracy degrades notably, but the model is
still useful. If the real use case is some sort of peculiar text (such
as cables or something) you probably won't be happy until you tag a
training corpus of the actual data involved.

On Wed, Jan 18, 2012 at 4:14 PM, mark meiklejohn
<[email protected]>  wrote:
Hi Jörn,

Thanks for your quick response.

Primarily the language is English, probably more American rather than
European.

Domain-wise for the NER 'date' related otherwise, input data is domain
independent. The current implementation/model for NER date detection is very
good, it is the odd edge case such as lower case days, which cause problems.

I could go to the lengths of probably writing a regex for it, but it would
be better to have a NLP solution, as these are already scanning input texts.

Your UIMA based annotation tooling sounds interesting and worth a look.

Thanks

Mark

On 18/01/2012 21:05, Jörn Kottmann wrote:
On 1/18/12 8:35 PM, mark meiklejohn wrote:
James,

I agree the correct way is to ensure upper-case. But when you have no
control over input it makes things a little more difficult.

So, I may look at a training set. What is the recommended size of a
training set?

In an annotation project I was doing lately our models started to work
after a couple
of hundred news articles. It of course depends on your language, domain
and the entities you
want to detect.

To make training easier I started to work on UIMA based annotation
tooling, let me know
if you would like to try that, any feedback is very welcome.

Jörn






Reply via email to