Hi Jörn I spent last couple of weeks understanding how OpenNLP parser does chunking, how chunking occurs separately in opennlp.tools.chunker, and I came to conclusion that using independently trained chunker on the results of parser gives significantly higher accuracy of resultant parsing, and therefore makes 'similarity' component much more accurate as a result. Lets look at an example (I added stars): two NP & VP are extracted, but what kills similarity component is the last part of the latter: ****to-TO drive-NN**** Parse Tree Chunk list = [NP [Its-PRP$ classy-JJ design-NN and-CC the-DT Mercedes-NNP name-NN ], VP [make-VBP it-PRP a-DT very-RB cool-JJ vehicle-NN *******to-TO drive-NN**** ]]
When I apply the chunker which has its own problems ( but most importantly was trained independently) I can then apply rules to fix these cases for matching with other sub-VP like 'to-VB'. I understand it works slower that way. I would propose we have two version of similarity, one that just does without chunker and one which uses it (and also an additional 'correction' algo ? ). I have now both versions, but only the latter passes current tests. RegardsBoris > Date: Thu, 17 Nov 2011 19:49:50 +0100 > From: [email protected] > To: [email protected] > Subject: Re: any hints on how to get chunking info from Parse? > > On 11/17/11 7:08 PM, Boris Galitsky wrote: > > Yes, I will try > > opennlp.tools.parser.ChunkSampleStream > > and meanwhile the question is: what is wrong with using > > opennlp.tools.chunker ? > > You are doing it then twice. The chunk information is already present inside > the parse tree. So if you have a Parse object already, you should > extract the > chunk information from it instead of running the chunker again. > > It is also harder to use, because a user then needs to provide you with > a Parse > object and a chunker instance. For the same reason it is harder to test > as well. > It will be slower because chunking needs to be done twice, and I guess > there are > a couple of more reasons why this is not the preferred solution. > > Let me know if you need help. > > Jörn
