What you want is an elementary discourse unit detector. Here's an example
paper that does this:

aclweb.org/anthology-new/N/N03/N03-1030.pdf

You could indeed do something like a sentence detector for this -- it's
just that it is less obvious where you need to make decisions.

Jason

On Fri, May 25, 2012 at 7:09 PM, Lance Norskog <[email protected]> wrote:

> Right. I was thinking the Chunker against chunked text. Meta-chunking
> :) The chunker is designed for a small vocabulary while NER is
> designed for a large vocabulary. The Chunker is really slow,
> meta-chunking I'm sure slower. Maybe the sentence parser?
>
> The application does not have to be all that correct. If the tree
> parse (A(B(C),D) where these are clauses in order, CD would be fine.
> Overlapping sub-sentences is the goal.
>
> On Wed, May 23, 2012 at 12:04 AM, Svetoslav Marinov
> <[email protected]> wrote:
> > Take the longest NP chunks? There are NP chunker models for English.
> > The results from the English NP chunker are quite granular so maybe the
> > length (about 30 words) should steer this.
> >
> > Alternatively, you can use the parser and get the longest Nps there as
> > well which are children of a VP. Maybe also start with the very basic NP
> > VP NP construction from the parse tree. This should, hopefully, give
> > meaningful clauses.
> >
> > And then, probably a weird idea is to mimic a NER system. Just use the
> > input from a POS tagger in connection with a RegEx NER finder. Your regex
> > will work on POS sequences (e.g. DT JJ* NP).
> >
> > Hope this helps.
> >
> > Best,
> >
> > Svetoslav
> >
> >
> > On 2012-05-23 05:29, "Lance Norskog" <[email protected]> wrote:
> >
> >>I would like to take a long sentence, let's say 30 words, and find
> >>clauses maybe 10-20 words long that are somewhat self-contained blocks
> >>of text; complete sentences or nearly. These clauses can be
> >>overlapping. What is a good way to use OpenNLP's tools?
> >>
> >>The application is for document summarization via LSA. This technique
> >>needs to operate on coherent statements rather than very long
> >>sentences. Some of my test data is riddled with 30-50-word sentences,
> >>and they have overlapping clauses which are coherent statements of the
> >>document themes.
> >>
> >>--
> >>Lance Norskog
> >>[email protected]
> >>
> >
> >
>
>
>
> --
> Lance Norskog
> [email protected]
>



-- 
Jason Baldridge
Associate Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Reply via email to