El 2017-07-01 11:00, Kevin Brubeck Unhammer escribió:
Francis Tyers <fty...@prompsit.com> čálii:

El 2017-07-01 09:18, Kevin Brubeck Unhammer escribió:
Jaume Ortolà i Font
<jaumeort...@gmail.com> čálii:

[...]

This could be solved differently. I think these contractions should
be tokenized earlier in the pipeline as two
tokens. This way we would avoid a lot of exceptions and workarounds
when dealing with them. Is it feasible?
These contractions are extremely frequent and now they cause a lot
of undesired results.

Yeah, you could split them early too. If "del" isn't ambiguous then you
don't really gain much by keeping it as one lexical unit.

How would you do that ? :)

The easiest way I can think of would be just to add a pre-disambiguation
CG that does nothing but split "<del>" (ADDCOHORT/REMCOHORT).

Hmm, I'm not sure I like this so much, especially if we have e.g. del as a part of a proper name (which we don't at the moment but it is conceivable).

But whatever makes it easier :)

F.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to