W dniu 2014-04-08 14:44, Daniel Naber pisze:
> On 2014-04-08 08:43, Marcin Miłkowski wrote:
>
>>> Internally, we now have information like this: postag=VBD, pos=verb,
>>> tense=past (etc.). But the disambiguation only works on the old tag? I
>>> guess I will need to resolve VBD here so the action works on both the
>>> old and the new representation?
>>
>> I think the only thing needed is to parse the tags again, if they are
>> different.
>
> Although I'm not sure if I understood what you meant, I have now added a
> branch ("readable-pos-tags") for this, simply because the changes are
> getting so complex. It's still incomplete and buggy.
>
> Here's the basic idea of my changes in that branch: class TokenPoS is
> the new structured representation of POS tags. EnglishTagger returns one
> or more TokenPoS for a given traditional POS tag (like NNS). More than
> one will be returned in cases that are ambiguous in the new
> representation, e.g. "walk/VBP" can be person=1|2 number=singular and
> person=1|2|3 person=plural. Each AnalyzedToken has one TokenPoS.
>
> Currently the problem is this (when running the tests):
> Caused by: org.xml.sax.SAXException: English rule error. The number of
> interpretations specified with wd: 5 must be equal to the number of
> matched tokens (1)
>    Line: 1525, column: 12.
>
> I roughly understand what the problem is but not yet the solution... any
> help is welcome, also any hints that what I'm doing in that branch might
> be wrong.

Hm, the line 1525 is:

     </rule>

So I'm not sure what is the problem. Basically, the number of <wd> 
elements have to correspond to the number of tokens inside the <marker> 
element. And that's it.

I thought that TokenPoS should be initialized after tagging and then any 
time the disambiguator makes the change to the AnalyzedToken.posTag. So 
whenever the disambiguator changes the values of the token, you need to 
make sure that the TokenPoS is up to date by re-running the POS tag 
parser. What's the problem with this approach? I assume that we are not 
talking about disambiguation rules (yet) that change TokenPoS only. This 
is a bit more tricky, as some tags might be more ambiguous than TokenPoS 
values, so we'd have to leave those ambiguous tags and prune or change 
only TokenPoS values. Luckily, I think only the Penn tagset is so 
ambiguous. Structured tagsets (such as German Morphy or Polish one) 
should be much easier to interpret.

Regards,
Marcin

------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to