Depends what you meant by valid. Some other approaches: 1) Parse (using opennlp.tools.parse.Parser) the of sentences using a carefully constructed model. Parsing will separate the sentences into phrases (noun, verb, preposition). The worst offenders will be marked with INC (incomplete). 2) If you know the specific language, you could tokenize each sentence and then count the words in the sentence that occur in that language but not English. Based on some threshold, you could mark the sentence. Global WordNet might have a list of words that would be useful. 3) You could mark a sentence as non-English based on a threshold of non-English words. Wordnet.princeton.edu has a rather complete word list. 4) Sentence could be filtered depending on the language based on the characters that outsize of A-Z, numbers, spaces and English punctuation. -----Original Message----- From: Kalle Karlsson [mailto:[email protected]] Sent: Tuesday, May 21, 2013 12:00 PM To: [email protected] Subject: Only detecting valid sentences?
I'm using Apache OpenNLP and I'm wondering if it is possible to train the sentence detector to only recognise valid (in some sense) sentences and discarding everything else? For example, let us say I have a document in english, but sprinkled inside that document there is sentences in another specific language and then I'd like to be able to detect only that other specific language? Is that possible with the sentence detector? Thanks.
