W dniu 2014-04-08 14:44, Daniel Naber pisze: > On 2014-04-08 08:43, Marcin Miłkowski wrote: > >>> Internally, we now have information like this: postag=VBD, pos=verb, >>> tense=past (etc.). But the disambiguation only works on the old tag? I >>> guess I will need to resolve VBD here so the action works on both the >>> old and the new representation? >> >> I think the only thing needed is to parse the tags again, if they are >> different. > > Although I'm not sure if I understood what you meant, I have now added a > branch ("readable-pos-tags") for this, simply because the changes are > getting so complex. It's still incomplete and buggy. > > Here's the basic idea of my changes in that branch: class TokenPoS is > the new structured representation of POS tags. EnglishTagger returns one > or more TokenPoS for a given traditional POS tag (like NNS). More than > one will be returned in cases that are ambiguous in the new > representation, e.g. "walk/VBP" can be person=1|2 number=singular and > person=1|2|3 person=plural. Each AnalyzedToken has one TokenPoS. > > Currently the problem is this (when running the tests): > Caused by: org.xml.sax.SAXException: English rule error. The number of > interpretations specified with wd: 5 must be equal to the number of > matched tokens (1) > Line: 1525, column: 12. > > I roughly understand what the problem is but not yet the solution... any > help is welcome, also any hints that what I'm doing in that branch might > be wrong.
Hm, the line 1525 is: </rule> So I'm not sure what is the problem. Basically, the number of <wd> elements have to correspond to the number of tokens inside the <marker> element. And that's it. I thought that TokenPoS should be initialized after tagging and then any time the disambiguator makes the change to the AnalyzedToken.posTag. So whenever the disambiguator changes the values of the token, you need to make sure that the TokenPoS is up to date by re-running the POS tag parser. What's the problem with this approach? I assume that we are not talking about disambiguation rules (yet) that change TokenPoS only. This is a bit more tricky, as some tags might be more ambiguous than TokenPoS values, so we'd have to leave those ambiguous tags and prune or change only TokenPoS values. Luckily, I think only the Penn tagset is so ambiguous. Structured tagsets (such as German Morphy or Polish one) should be much easier to interpret. Regards, Marcin ------------------------------------------------------------------------------ Put Bad Developers to Shame Dominate Development with Jenkins Continuous Integration Continuously Automate Build, Test & Deployment Start a new project now. Try Jenkins in the cloud. http://p.sf.net/sfu/13600_Cloudbees _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel