Hi all,

I originally posted the below on StackOverflow, but have not received any
answers yet. This list seemed like a good place to go next.

I am using OpenNLP to extract noun phrases from documents. In reviewing the
output, I discovered that the phrase chunker ignores commas, leading to
noun phrases that combine, for instance, multiple elements of a list into
one phrase or two clauses in a sentence into one noun phrase. As a dummy
example:

    public class TestTokenizer {
    public static void main(String[] args) throws IOException {
    String content = "dog, cat, fish, rat";
    String[] tokens =
NLPToolsControllerOpenNLP.getInstance().getTokeniser().tokenize(content);
            String[] pos =
NLPToolsControllerOpenNLP.getInstance().getPosTagger().tag(tokens);
            String[] phrases =
NLPToolsControllerOpenNLP.getInstance().getPhraseChunker().chunk(tokens,
pos);
            for(int i = 0; i<tokens.length; i++) {
            System.out.println("Token: " + tokens[i] + " and POS: " +
phrases[i]);
            }
            List<String> candidates = new ArrayList<String>();
            String phrase = "";
            for (int n = 0; n < tokens.length; n++) {
                if (phrases[n].equals("B-NP")) {
                    phrase = tokens[n];
                    for (int m = n + 1; m < tokens.length; m++) {
                        if (phrases[m].equals("I-NP")) {
                        phrase = phrase + " " + tokens[m];
                        } else {
                            n = m;
                            break;
                        }
                    }
                    phrase = phrase.replaceAll("\\s+", " ").trim();
                    System.out.println("phrase: " + phrase);
    }
    }

outputs:

    Token: dog and POS: B-NP
    Token: , and POS: I-NP
    Token: cat and POS: I-NP
    Token: , and POS: I-NP
    Token: fish and POS: I-NP
    Token: , and POS: O
    Token: rat and POS: B-NP
    phrase: dog , cat , fish
    phrase: rat

Parentheses have the same issue: because the chunker tags them with I-NP, I
end up with noun phrases containing them.

The OpenNLP documentation says that "The **OpenNLP Sentence Detector** can
detect that a punctuation character marks the end of a sentence or not." As
such, I am a bit surprised that the phrase detector cannot detect the use
of a comma or a parenthesis to mark the beginning or end of a phrase. Is
there something I am missing here? Is there another approach that I should
use? I am trying to avoid dealing with these issues on a case-by-case basis
in a large corpus.


Thanks,
Matt

Reply via email to