What you want is an elementary discourse unit detector. Here's an example paper that does this:
aclweb.org/anthology-new/N/N03/N03-1030.pdf You could indeed do something like a sentence detector for this -- it's just that it is less obvious where you need to make decisions. Jason On Fri, May 25, 2012 at 7:09 PM, Lance Norskog <[email protected]> wrote: > Right. I was thinking the Chunker against chunked text. Meta-chunking > :) The chunker is designed for a small vocabulary while NER is > designed for a large vocabulary. The Chunker is really slow, > meta-chunking I'm sure slower. Maybe the sentence parser? > > The application does not have to be all that correct. If the tree > parse (A(B(C),D) where these are clauses in order, CD would be fine. > Overlapping sub-sentences is the goal. > > On Wed, May 23, 2012 at 12:04 AM, Svetoslav Marinov > <[email protected]> wrote: > > Take the longest NP chunks? There are NP chunker models for English. > > The results from the English NP chunker are quite granular so maybe the > > length (about 30 words) should steer this. > > > > Alternatively, you can use the parser and get the longest Nps there as > > well which are children of a VP. Maybe also start with the very basic NP > > VP NP construction from the parse tree. This should, hopefully, give > > meaningful clauses. > > > > And then, probably a weird idea is to mimic a NER system. Just use the > > input from a POS tagger in connection with a RegEx NER finder. Your regex > > will work on POS sequences (e.g. DT JJ* NP). > > > > Hope this helps. > > > > Best, > > > > Svetoslav > > > > > > On 2012-05-23 05:29, "Lance Norskog" <[email protected]> wrote: > > > >>I would like to take a long sentence, let's say 30 words, and find > >>clauses maybe 10-20 words long that are somewhat self-contained blocks > >>of text; complete sentences or nearly. These clauses can be > >>overlapping. What is a good way to use OpenNLP's tools? > >> > >>The application is for document summarization via LSA. This technique > >>needs to operate on coherent statements rather than very long > >>sentences. Some of my test data is riddled with 30-50-word sentences, > >>and they have overlapping clauses which are coherent statements of the > >>document themes. > >> > >>-- > >>Lance Norskog > >>[email protected] > >> > > > > > > > > -- > Lance Norskog > [email protected] > -- Jason Baldridge Associate Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge
