> > Sorry I'm a bit late to this discussion. I think it is fine to have a > default way that similarity is assessed, but it shouldn't be completely hid > from the user that there may be other choices. For example, have you done > standard similarity based on the standard bag-of-words model?
I 've got the code, it is pretty basic, which will also do bag-of-words. I will add it That's at > least needed as a baseline to see whether the chunks and tree structures > are helping. (Sorry in advance if this is not addressing the questions and > such -- I seem to be missing some context in the discussion.) > > FWIW, it is entirely possible for a chunker to produce better local > structure than a full parser. This is pretty well known in the dependency > parsing literature (e.g. see the comparison of MaltParser and MSTParser by > Nivre and McDonald). Also, if you want unsupervised chunks, you might check > out work that Elias Ponvert, Katrin Erk, and I did on using HMMs for this > (and cascading them to get full parses). Code and paper available here: > > http://elias.ponvert.net/upparse Yes, will take a look Regards Boris > > Jason > > On Fri, Dec 2, 2011 at 10:12 AM, Boris Galitsky <[email protected]>wrote: > > > > > My philosophy for similarity component is that an engineer without > > background in linguistic can do text processing. > > He/she would install OpenNLP, and would call assessRelevance(text1, text2) > > function, without any knowledge of what is heppening inside. > > That would significantly extend the user base of OpenNLP. > > The problem domains I used for illustration is search (a standard domain > > for linguistic apps) and content generation (a state-of-art technology, in > > my opinion). Again, to incorporate these into user apps users do not need > > to know anything about parsing, chunking, etc. > > RegardsBoris > > > > > > > > > > > > > Date: Fri, 2 Dec 2011 13:10:23 +0100 > > > From: [email protected] > > > To: [email protected] > > > Subject: Re: any hints on how to get chunking info from Parse? > > > > > > On 12/1/11 8:08 PM, Boris Galitsky wrote: > > > > I spent last couple of weeks understanding how OpenNLP parser does > > chunking, how chunking occurs separately in opennlp.tools.chunker, and I > > came to conclusion that using independently trained chunker on the results > > of parser gives significantly higher accuracy of resultant parsing, and > > therefore makes 'similarity' component much more accurate as a result. > > > > Lets look at an example (I added stars): > > > > two NP& VP are extracted, but what kills similarity component is the > > last part of the latter: > > > > ****to-TO drive-NN**** > > > > Parse Tree Chunk list = [NP [Its-PRP$ classy-JJ design-NN and-CC > > the-DT Mercedes-NNP name-NN ], VP [make-VBP it-PRP a-DT very-RB cool-JJ > > vehicle-NN *******to-TO drive-NN**** ]] > > > > > > > > When I apply the chunker which has its own problems ( but most > > importantly was trained independently) I can then apply rules to fix these > > cases for matching with other sub-VP like 'to-VB'. > > > > I understand it works slower that way. > > > > I would propose we have two version of similarity, one that just does > > without chunker and one which uses it (and also an additional 'correction' > > algo ? ). > > > > I have now both versions, but only the latter passes current tests. > > > > > > Ok, sounds good to me, but we should assume that the user can run the > > > parser and chunker them self. Your similarity component simply accepts > > > a parse tree in one case and a parse tree plus chunks in the other case. > > > > > > What do you think? > > > > > > Jörn > > > > > > > > -- > Jason Baldridge > Associate Professor, Department of Linguistics > The University of Texas at Austin > http://www.jasonbaldridge.com > http://twitter.com/jasonbaldridge
