Depends what you meant by valid.
Some other approaches:
1) Parse (using opennlp.tools.parse.Parser) the  of sentences using a carefully 
constructed model. Parsing will separate the sentences into phrases (noun, 
verb, preposition).
The worst offenders will be marked with INC (incomplete). 
2) If you know the specific language, you could tokenize each sentence and then 
count the words in the sentence that occur in that language but not English. 
Based on some threshold, you could mark the sentence. Global WordNet might have 
a list of words that would be useful.
3) You could mark a sentence as non-English based on a threshold of non-English 
words. Wordnet.princeton.edu has a rather complete word list.
4) Sentence could be filtered depending on the language based on the characters 
that outsize of A-Z, numbers, spaces and English punctuation.
-----Original Message-----
From: Kalle Karlsson [mailto:[email protected]] 
Sent: Tuesday, May 21, 2013 12:00 PM
To: [email protected]
Subject: Only detecting valid sentences?

I'm using Apache OpenNLP and I'm wondering if it is possible to train the 
sentence detector to only recognise valid (in some sense) sentences and 
discarding everything else? 
For example, let us say I have a document in english, but sprinkled inside that 
document there is sentences in another specific language and then I'd like to 
be able to detect only that other specific language? 
Is that possible with the sentence detector?
Thanks.

Reply via email to