*nod* Thanks. The NER will be applied to IRC channel traffic eventually, so ideally we'd pull enough channel traffic to start identifying entities (projects, really) accurately. The markov chain idea sounds better and better to me as an experiment: take IRC data, replace a few select tokens with a placeholder, generate lots of input from the chain, generating entities in place of the placeholders. We'll see how well that works as I progress.
On Sun, Jul 9, 2017 at 12:36 PM, Daniel Russ <[email protected]> wrote: > Hi Joseph, > I don’t remember exactly what features the NER uses, but a general rule > of thumb is that you want the training data resembles the unseen data. > Think of the training data as a sampling experiment, the closer the sample > gets to the population (data not seen) the better the classifier will > work.You certainly can use the presences of a word in dictionary as a > feature, and that will probably help with the classification. If you > provide a little more about the problem, I could expand the answer a bit. > Daniel > > > > > On Jul 9, 2017, at 9:38 AM, Joseph B. Ottinger <[email protected]> > wrote: > > > > I was planning on training my own model, but I wondered what kind of > input > > data would give the best results; does the training data have to make > > sense, or be representative of common input? I have a dictionary of terms > > to mark as entities, and while I have a good bit of sensible data, I need > > to add entities to the model fairly often; typically I'll have the entity > > name and fairly little information to go with it, so it'd be easiest to > use > > something like a Markov chain generator to generate content around the > > entity, or something. I could also generate fairly static content, but > I'd > > prefer to train the system well, if possible. > >
