Right. I was thinking the Chunker against chunked text. Meta-chunking
:) The chunker is designed for a small vocabulary while NER is
designed for a large vocabulary. The Chunker is really slow,
meta-chunking I'm sure slower. Maybe the sentence parser?

The application does not have to be all that correct. If the tree
parse (A(B(C),D) where these are clauses in order, CD would be fine.
Overlapping sub-sentences is the goal.

On Wed, May 23, 2012 at 12:04 AM, Svetoslav Marinov
<[email protected]> wrote:
> Take the longest NP chunks? There are NP chunker models for English.
> The results from the English NP chunker are quite granular so maybe the
> length (about 30 words) should steer this.
>
> Alternatively, you can use the parser and get the longest Nps there as
> well which are children of a VP. Maybe also start with the very basic NP
> VP NP construction from the parse tree. This should, hopefully, give
> meaningful clauses.
>
> And then, probably a weird idea is to mimic a NER system. Just use the
> input from a POS tagger in connection with a RegEx NER finder. Your regex
> will work on POS sequences (e.g. DT JJ* NP).
>
> Hope this helps.
>
> Best,
>
> Svetoslav
>
>
> On 2012-05-23 05:29, "Lance Norskog" <[email protected]> wrote:
>
>>I would like to take a long sentence, let's say 30 words, and find
>>clauses maybe 10-20 words long that are somewhat self-contained blocks
>>of text; complete sentences or nearly. These clauses can be
>>overlapping. What is a good way to use OpenNLP's tools?
>>
>>The application is for document summarization via LSA. This technique
>>needs to operate on coherent statements rather than very long
>>sentences. Some of my test data is riddled with 30-50-word sentences,
>>and they have overlapping clauses which are coherent statements of the
>>document themes.
>>
>>--
>>Lance Norskog
>>[email protected]
>>
>
>



-- 
Lance Norskog
[email protected]

Reply via email to