Re: Roadmap for Spanish

Marcin Miłkowski Wed, 06 Apr 2016 11:29:08 -0700

W dniu 06.04.2016 o 14:55, Juan Martorell pisze:
>
>
> On 4 April 2016 at 19:28, Marcin Miłkowski <list-addr...@wp.pl
> <mailto:list-addr...@wp.pl>> wrote:
>
>     Hi,
>
>     W dniu 03.04.2016 o 12:46, Juan Martorell pisze:
>     > I realized this because every rule
>     > I added introduced a new regression sometimes worsening what we had
>     > before a lot.
>
>     I could try to help to avoid this with some tricks with disambiguation.
>
>
> Please update the wiki so everyone can benefit of them.


Heh, the trouble is that it's mostly implicit knowledge. I'll try to 
write up some strategies.

>
>
>     > I therefore blame the dictionary for all this, and no good
>     > disambiguation can be done without a decent tagging. I am tired of
>     > waiting for someone else to volunteer, every time someone shows up, she
>     > seems intimidated by the task and she eventually loses interest.
>
>     What's the problem with the dictionary?
>
>     (1) It assigns too many POS tags, making disambiguation difficult.
>
>     (2) It lacks important POS tags, so disambiguation cannot help.
>
>     If (1), I really can help by writing up some methods I have found for
>     Polish and English. I can read Spanish, so this should be fairly easy.
>
>
> Mostly (2). Freeling tends to keep the lexic roots and calculate the
> inflections, so the dictionary is rather incomplete.

That means you should expand the dictionary a lot, IMHO.


> To transform one adjective into an adverb, in English you use the suffix
> `-ly` and in Spanish you use the suffix `-mente`:
>
> Equal --> equally
> Igual --> igualmente
>
> I found 18340 candidates for suffixation in the Spanish dictionary for
> this particular case.

I'd add them to the dictionary. Why? Because these things might be false 
alarms, and removing them later by hand might be easier.

>
> Same for diminutives, augmentatives and superlatives. Depending on the
> zone these may vary, but if you want to be fully inclusive
> <https://es.wikipedia.org/wiki/Diminutivo>, you have to include 17
> diminutives, both genders; 9 augmentatives, both genders; 1 superlative,
> both genders excluding the irregular forms
> <https://es.wikipedia.org/wiki/Superlativo>. They apply to the same
> ~18000 candidates. They are widely used in writing, so it is worth to
> include them.

Well, I don't really tag diminutives (except for the most frequent that 
are already included in the dictionary) and it doesn't really hurt.

> It is quite common to attach some pronouns to the verb thus including
> information about direct and/or indirect object, or passive/impersonal
> voice. Combinations are hughe, some like:
>
> infinitive+pronoun as DO; example: from /subir/: /subir*me*/,
> /subir*te*/, /subir*se*/, /subir*lo*/, /subir*la*/, /subirn*os*/,
> /subir*os*/, /subir/*se*, /subir*los*/, /subir*las*/.
> infinitive+pronoun as IO+pronoun as DO; example: from /subir/:
> /subír*_te_me*/, /subír*_se_me*/, /subír*_me_te*/, /subír*_se_te*/,
> /subír*_se_nos*/, /subír*_nos_los*/, /subír*_os_las* /etc.
> imperative+pronoun as DO:; example: from /subir/:/súbeme/, /subid*me*/,
> /súba*me*/, /súban*me*/; /súbe*te*/, /subí*os*/, /súba*se*/, /súban*se*
> /etc.
> imperative+pronoun as IO+pronoun as DO; example: from /subir/:
> /súbe*_me_lo*/, /súbe*_te_lo*/, /súbe_*te*_*me*/, /subí*_os_las* /etc.

I'd go for Jaume's strategies with Catalan. They are probably exactly 
suited to your situation.

I would tokenize this internally if it doesn't lead to any ambiguity 
(you don't need a space to tokenize). I don't do this for Polish as we 
have lots of ambiguities: "miałem" might be past of "mieć" and the past 
would be "miał" + "em", "em" being the agglutinate for the first person 
singular, but it's also instrumentative of the noun "miał", which 
shouldn't be tokenized. We have a stream of tokens, and we would need to 
replace it with a graph (one edge for "miałem", another for "miał" + 
"em"), which is not exactly the nicest thing to play with. So I don't 
tokenize but have a non-tokenized list hardcoded in the dictionary.

>
> Gerund accepts the same derivation.
>
> These derivations are enough on themselves to justify some automation:
> so far 18000 adjectives * (1 adverb + 17 * 2 diminutives + 9 * 2
> augmentatives + 1 augmentative) =~ 972000 words to include in the
> dictionary.
> If you ad all the pronominal derivations: 7654 verbs * ( (10 IPasDO +
> (6! / 2!(6-4)!) IPasIODO) * 3 verbal tenses = 7654 * (10 + 15) * 3 =~
> 574000 words to include in the dictionary.
>
> It makes a total of aprox 1,5 millon words to include, excluding the
> American inflections.

That's a very small number overall.

>
> Today the dictionary holds ~660 000 words.

The Polish dictionary has around 4 500 000 words. But hey, we have 
crazily inflected nouns ;)

>
> Other derivations imply prefixing like */re-/* or /*anti-*/. These can
> be applied to both verbs (all conjugations), adjetives and nouns:
> /*re*iniciar/, /*anti*person//al/, /*re*calificar/, /*re*incidente/,
> /*re*forestar/. Some of this are present in the dictionary, some are
> not, l will not include them into the account but it may be relevant.

I only include productive prefixes for spelling, not for tagging. As 
Jaume mentioned, this may lead to a large number of false matches.

>
>
>     Let's discuss this in depth. I'd love to see Spanish better supported.
>
>
> I see an opportunity using a word database with frequencies. It will
> shortlist some potential exclussions. It won't be too complex using the
> wikipedia dump.
>
> BTW I see that the wikipedia dump has some bias. It is mainly written in
> indirect style and its neutrality reduces a lot the use of
> augmentatives, diminutives, superlatives and imperatives. Subjunctive
> and conditional are also less probable to happen. I was looking for some
> style variation and I came up with the Wikisource
> <https://en.wikisource.org/wiki/Wikisource:What_is_Wikisource%3F>. It
> has some bias as well, like texts being old due to copyrights, or
> old-fashion language, but I there is an opportunity here since it has
> many kind of documents, from legal texts to science fiction novels,
> original or translated texts etc.

I used our indexer on a large newspaper corpus, and on some corpus of 
literary works. I agree, Wikipedia is strongly biased.

I also have a large English blog corpus that I downloaded from 
somewhere, also indexed on my hdd.

>
> I tested the latest dump eswikisource-20160305-pages-articles.xml (~175
> Mb compressed, ~774 Mb uncompressed) and it seems to proofread OK with
> the regular wikimedia checker.
> I encourage you to check your language in the Wikimedia backup folder
> <http://dumps.wikimedia.org/backup-index.html> and test it. You will
> find an inmese English corpus, and respectable sizes for German, Polish,
> Catalan and so.

Thanks for the pointed. The more data, the better.


> So growing the dictionary may improve speed at cost of memory,

Don't worry about that. Finite state machines can easily accommodate a 
much larger dictionary. It's all pretty much regular, so the automaton 
will be small anyway, as lots of arcs will be shared.

> Using statistics may reduce the size of the dictionary
> by excluding the rarest/illegal constructions and the process should be
> automated to keep it repeatable and predictable. This means to automate
> the exclussions as well.

Use exclusions only to make the quality of tagging higher. The 
performance of the BaseTagger is not a problem.

Regards,
Marcin

------------------------------------------------------------------------------
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Roadmap for Spanish

Reply via email to