Sorry I'm a bit late to this discussion. I think it is fine to have a default way that similarity is assessed, but it shouldn't be completely hid from the user that there may be other choices. For example, have you done standard similarity based on the standard bag-of-words model? That's at least needed as a baseline to see whether the chunks and tree structures are helping. (Sorry in advance if this is not addressing the questions and such -- I seem to be missing some context in the discussion.)
FWIW, it is entirely possible for a chunker to produce better local structure than a full parser. This is pretty well known in the dependency parsing literature (e.g. see the comparison of MaltParser and MSTParser by Nivre and McDonald). Also, if you want unsupervised chunks, you might check out work that Elias Ponvert, Katrin Erk, and I did on using HMMs for this (and cascading them to get full parses). Code and paper available here: http://elias.ponvert.net/upparse Jason On Fri, Dec 2, 2011 at 10:12 AM, Boris Galitsky <[email protected]>wrote: > > My philosophy for similarity component is that an engineer without > background in linguistic can do text processing. > He/she would install OpenNLP, and would call assessRelevance(text1, text2) > function, without any knowledge of what is heppening inside. > That would significantly extend the user base of OpenNLP. > The problem domains I used for illustration is search (a standard domain > for linguistic apps) and content generation (a state-of-art technology, in > my opinion). Again, to incorporate these into user apps users do not need > to know anything about parsing, chunking, etc. > RegardsBoris > > > > > > > Date: Fri, 2 Dec 2011 13:10:23 +0100 > > From: [email protected] > > To: [email protected] > > Subject: Re: any hints on how to get chunking info from Parse? > > > > On 12/1/11 8:08 PM, Boris Galitsky wrote: > > > I spent last couple of weeks understanding how OpenNLP parser does > chunking, how chunking occurs separately in opennlp.tools.chunker, and I > came to conclusion that using independently trained chunker on the results > of parser gives significantly higher accuracy of resultant parsing, and > therefore makes 'similarity' component much more accurate as a result. > > > Lets look at an example (I added stars): > > > two NP& VP are extracted, but what kills similarity component is the > last part of the latter: > > > ****to-TO drive-NN**** > > > Parse Tree Chunk list = [NP [Its-PRP$ classy-JJ design-NN and-CC > the-DT Mercedes-NNP name-NN ], VP [make-VBP it-PRP a-DT very-RB cool-JJ > vehicle-NN *******to-TO drive-NN**** ]] > > > > > > When I apply the chunker which has its own problems ( but most > importantly was trained independently) I can then apply rules to fix these > cases for matching with other sub-VP like 'to-VB'. > > > I understand it works slower that way. > > > I would propose we have two version of similarity, one that just does > without chunker and one which uses it (and also an additional 'correction' > algo ? ). > > > I have now both versions, but only the latter passes current tests. > > > > Ok, sounds good to me, but we should assume that the user can run the > > parser and chunker them self. Your similarity component simply accepts > > a parse tree in one case and a parse tree plus chunks in the other case. > > > > What do you think? > > > > Jörn > > -- Jason Baldridge Associate Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge
