Hi all,
I originally posted the below on StackOverflow, but have not received any
answers yet. This list seemed like a good place to go next.
I am using OpenNLP to extract noun phrases from documents. In reviewing the
output, I discovered that the phrase chunker ignores commas, leading to
noun phrases that combine, for instance, multiple elements of a list into
one phrase or two clauses in a sentence into one noun phrase. As a dummy
example:
public class TestTokenizer {
public static void main(String[] args) throws IOException {
String content = "dog, cat, fish, rat";
String[] tokens =
NLPToolsControllerOpenNLP.getInstance().getTokeniser().tokenize(content);
String[] pos =
NLPToolsControllerOpenNLP.getInstance().getPosTagger().tag(tokens);
String[] phrases =
NLPToolsControllerOpenNLP.getInstance().getPhraseChunker().chunk(tokens,
pos);
for(int i = 0; i<tokens.length; i++) {
System.out.println("Token: " + tokens[i] + " and POS: " +
phrases[i]);
}
List<String> candidates = new ArrayList<String>();
String phrase = "";
for (int n = 0; n < tokens.length; n++) {
if (phrases[n].equals("B-NP")) {
phrase = tokens[n];
for (int m = n + 1; m < tokens.length; m++) {
if (phrases[m].equals("I-NP")) {
phrase = phrase + " " + tokens[m];
} else {
n = m;
break;
}
}
phrase = phrase.replaceAll("\\s+", " ").trim();
System.out.println("phrase: " + phrase);
}
}
outputs:
Token: dog and POS: B-NP
Token: , and POS: I-NP
Token: cat and POS: I-NP
Token: , and POS: I-NP
Token: fish and POS: I-NP
Token: , and POS: O
Token: rat and POS: B-NP
phrase: dog , cat , fish
phrase: rat
Parentheses have the same issue: because the chunker tags them with I-NP, I
end up with noun phrases containing them.
The OpenNLP documentation says that "The **OpenNLP Sentence Detector** can
detect that a punctuation character marks the end of a sentence or not." As
such, I am a bit surprised that the phrase detector cannot detect the use
of a comma or a parenthesis to mark the beginning or end of a phrase. Is
there something I am missing here? Is there another approach that I should
use? I am trying to avoid dealing with these issues on a case-by-case basis
in a large corpus.
Thanks,
Matt