Thanks for a quick reply! I was thinking about using a variation of a bag of words approach for the actual model, so LIBSVM/LIBLINEAR is probably a better fit for my data than opennlp, though I do appreciate advise on alternative approaches =)
The thought about using NER was to more easily find correlations with nearby words (i.e. once a named entity is found replace it with a predetermined token and use that token for correlations and as input to the model) rather than to ”just” extracting them; as the product name and company are uniquely linked to a product I would prefer some kind of NER that takes that into consideration so that a higher precision and recall rate can be achieved by the premeditated information. Sincerely 31 okt 2014 kl. 22:00 skrev Mark G <[email protected]>: > Well here are my thoughts... if you know the product review is associated > with a name, do you still need to perform NER to get names out? If not, one > approach I have done with sentiment analysis in OpenNLP (I run a pretty > large scale production app that performs sentiment analysis with a model > generated from millions of samples) is find some words or phrases that you > are certain 99% of the time are indicators of sentiment, like "this sucks" > or "awesome", put those words in a set, read in your data in java (or > whatever) use regex or .contains and whatever gets a hit on each word or > phrase, save that off as an initial training set and build a Doccat model > from those. Then run more data through the model, pick the top scorers and > add them to the samples etc... do this until you converge on a decent model > and then use it. > If you need to pull out names, do the same thing... start with a list of > known names, create the NER file format by finding the known names, train > the model with those initial sentences, then iterate. For iterative NER > training there is an Addon I wrote that can help with that called > modelbuilder-addon. It's like semi supervised teaching.... > > > On Fri, Oct 31, 2014 at 4:43 PM, Alexander Wallin < > [email protected]> wrote: > >> Hi! >> >> I’m writing a sentiment analysis application based on product reviews and >> am interested in using opennlp for identifying named entities and >> tokenization. The problem is that the standard models on the project >> homepage isn’t identifying nearly enough entities and training a completely >> new model based on my data is outside the scope of my project. >> >> Both training and test set texts has additional information available; is >> there any way to augment (for instance) the person model to (be more likely >> to) properly identify Britney Spears as a person in case the text is a >> product review of her CD (and it’s known beforehand that it’s ”her” CD) or >> to identify Google as a company if it’s a review of one of their products >> (under the same conditions)? >> >> Is a (pre trained) model approach incorrect? Should I use a regex based >> model instead? Other approach? Unfeasible idea and I should reconsider? >> >> >> I appreciate any answer and whatever time you spent reading me email. >> >> >> Sincerely >> >> Alexander
