Hi Joseph, I don’t remember exactly what features the NER uses, but a general rule of thumb is that you want the training data resembles the unseen data. Think of the training data as a sampling experiment, the closer the sample gets to the population (data not seen) the better the classifier will work.You certainly can use the presences of a word in dictionary as a feature, and that will probably help with the classification. If you provide a little more about the problem, I could expand the answer a bit. Daniel
> On Jul 9, 2017, at 9:38 AM, Joseph B. Ottinger <[email protected]> wrote: > > I was planning on training my own model, but I wondered what kind of input > data would give the best results; does the training data have to make > sense, or be representative of common input? I have a dictionary of terms > to mark as entities, and while I have a good bit of sensible data, I need > to add entities to the model fairly often; typically I'll have the entity > name and fairly little information to go with it, so it'd be easiest to use > something like a Markov chain generator to generate content around the > entity, or something. I could also generate fairly static content, but I'd > prefer to train the system well, if possible.
