Hi Jörn
  I spent last couple of weeks understanding how OpenNLP parser does chunking, 
how chunking occurs separately in opennlp.tools.chunker, and I came to 
conclusion that using independently trained chunker on the results of parser 
gives significantly higher accuracy of resultant parsing, and therefore makes 
'similarity' component much more accurate as a result.
Lets look at an example (I added stars):
two NP & VP are extracted, but what kills similarity component is the last part 
of the latter:
****to-TO drive-NN****
Parse Tree Chunk list = [NP [Its-PRP$ classy-JJ design-NN and-CC the-DT 
Mercedes-NNP name-NN ], VP [make-VBP it-PRP a-DT very-RB cool-JJ vehicle-NN 
*******to-TO drive-NN**** ]]

When I apply the chunker which has its own problems ( but most importantly was 
trained independently)  I can then apply rules to fix these cases for matching 
with other sub-VP like 'to-VB'.
I understand it works slower that way.
I would propose we have two version of similarity, one that just does without 
chunker and one which uses it (and also an additional 'correction' algo ? ).
I have now both versions, but only the latter passes current tests.
RegardsBoris



> Date: Thu, 17 Nov 2011 19:49:50 +0100
> From: [email protected]
> To: [email protected]
> Subject: Re: any hints on how to get chunking info from Parse?
> 
> On 11/17/11 7:08 PM, Boris Galitsky wrote:
> > Yes, I will try
> >   opennlp.tools.parser.ChunkSampleStream
> > and meanwhile the question is: what is wrong with using
> > opennlp.tools.chunker ?
> 
> You are doing it then twice. The chunk information is already present inside
> the parse tree. So if you have a Parse object already, you should 
> extract the
> chunk information from it instead of running the chunker again.
> 
> It is also harder to use, because a user then needs to provide you with 
> a Parse
> object and a chunker instance. For the same reason it is harder to test 
> as well.
> It will be slower because chunking needs to be done twice, and I guess 
> there are
> a couple of more reasons why this is not the preferred solution.
> 
> Let me know if you need help.
> 
> Jörn
                                          

Reply via email to