Hi Joseph, If you already have IRC channel data, I would suggest using something like the brat annotator and annotate the entities you want the classifier to find. It may take some time to accumulate enough training data, but it would be exactly the type of training data you want. I think that if you chose to use a markov chain, you would essential be training a classifier to learn the parameters of your markov chain. I don’t want to discourage you from trying the Markov chain, it may work (please report back). I remember hearing somewhere (in the context of neural networks) that synthetic data is useful for training, but not as useful as real data (maybe from Hinton’s Coursera course). I think annotation is the more “standard path” people take. I mention the brat annotator, because openNLP already can handle data in that format. Hope it works for you… Daniel
> On Jul 9, 2017, at 1:20 PM, Joseph B. Ottinger <[email protected]> wrote: > > *nod* Thanks. The NER will be applied to IRC channel traffic eventually, so > ideally we'd pull enough channel traffic to start identifying entities > (projects, really) accurately. The markov chain idea sounds better and > better to me as an experiment: take IRC data, replace a few select tokens > with a placeholder, generate lots of input from the chain, generating > entities in place of the placeholders. We'll see how well that works as I > progress. > > On Sun, Jul 9, 2017 at 12:36 PM, Daniel Russ <[email protected]> wrote: > >> Hi Joseph, >> I don’t remember exactly what features the NER uses, but a general rule >> of thumb is that you want the training data resembles the unseen data. >> Think of the training data as a sampling experiment, the closer the sample >> gets to the population (data not seen) the better the classifier will >> work.You certainly can use the presences of a word in dictionary as a >> feature, and that will probably help with the classification. If you >> provide a little more about the problem, I could expand the answer a bit. >> Daniel >> >> >> >>> On Jul 9, 2017, at 9:38 AM, Joseph B. Ottinger <[email protected]> >> wrote: >>> >>> I was planning on training my own model, but I wondered what kind of >> input >>> data would give the best results; does the training data have to make >>> sense, or be representative of common input? I have a dictionary of terms >>> to mark as entities, and while I have a good bit of sensible data, I need >>> to add entities to the model fairly often; typically I'll have the entity >>> name and fairly little information to go with it, so it'd be easiest to >> use >>> something like a Markov chain generator to generate content around the >>> entity, or something. I could also generate fairly static content, but >> I'd >>> prefer to train the system well, if possible. >> >>
