Well here are my thoughts... if you know the product review is associated with a name, do you still need to perform NER to get names out? If not, one approach I have done with sentiment analysis in OpenNLP (I run a pretty large scale production app that performs sentiment analysis with a model generated from millions of samples) is find some words or phrases that you are certain 99% of the time are indicators of sentiment, like "this sucks" or "awesome", put those words in a set, read in your data in java (or whatever) use regex or .contains and whatever gets a hit on each word or phrase, save that off as an initial training set and build a Doccat model from those. Then run more data through the model, pick the top scorers and add them to the samples etc... do this until you converge on a decent model and then use it. If you need to pull out names, do the same thing... start with a list of known names, create the NER file format by finding the known names, train the model with those initial sentences, then iterate. For iterative NER training there is an Addon I wrote that can help with that called modelbuilder-addon. It's like semi supervised teaching....
On Fri, Oct 31, 2014 at 4:43 PM, Alexander Wallin < [email protected]> wrote: > Hi! > > I’m writing a sentiment analysis application based on product reviews and > am interested in using opennlp for identifying named entities and > tokenization. The problem is that the standard models on the project > homepage isn’t identifying nearly enough entities and training a completely > new model based on my data is outside the scope of my project. > > Both training and test set texts has additional information available; is > there any way to augment (for instance) the person model to (be more likely > to) properly identify Britney Spears as a person in case the text is a > product review of her CD (and it’s known beforehand that it’s ”her” CD) or > to identify Google as a company if it’s a review of one of their products > (under the same conditions)? > > Is a (pre trained) model approach incorrect? Should I use a regex based > model instead? Other approach? Unfeasible idea and I should reconsider? > > > I appreciate any answer and whatever time you spent reading me email. > > > Sincerely > > Alexander
