Here at Basis, we train an English Uppercase model by just uppercasing
our training data. The accuracy degrades notably, but the model is
still useful. If the real use case is some sort of peculiar text (such
as cables or something) you probably won't be happy until you tag a
training corpus of the actual data involved.

On Wed, Jan 18, 2012 at 4:14 PM, mark meiklejohn
<[email protected]> wrote:
> Hi Jörn,
>
> Thanks for your quick response.
>
> Primarily the language is English, probably more American rather than
> European.
>
> Domain-wise for the NER 'date' related otherwise, input data is domain
> independent. The current implementation/model for NER date detection is very
> good, it is the odd edge case such as lower case days, which cause problems.
>
> I could go to the lengths of probably writing a regex for it, but it would
> be better to have a NLP solution, as these are already scanning input texts.
>
> Your UIMA based annotation tooling sounds interesting and worth a look.
>
> Thanks
>
> Mark
>
> On 18/01/2012 21:05, Jörn Kottmann wrote:
>>
>> On 1/18/12 8:35 PM, mark meiklejohn wrote:
>>>
>>> James,
>>>
>>> I agree the correct way is to ensure upper-case. But when you have no
>>> control over input it makes things a little more difficult.
>>>
>>> So, I may look at a training set. What is the recommended size of a
>>> training set?
>>>
>>
>> In an annotation project I was doing lately our models started to work
>> after a couple
>> of hundred news articles. It of course depends on your language, domain
>> and the entities you
>> want to detect.
>>
>> To make training easier I started to work on UIMA based annotation
>> tooling, let me know
>> if you would like to try that, any feedback is very welcome.
>>
>> Jörn
>>
>>
>>
>>
>
>

Reply via email to