W dniu 06.04.2016 o 14:55, Juan Martorell pisze: > > > On 4 April 2016 at 19:28, Marcin Miłkowski <list-addr...@wp.pl > <mailto:list-addr...@wp.pl>> wrote: > > Hi, > > W dniu 03.04.2016 o 12:46, Juan Martorell pisze: > > I realized this because every rule > > I added introduced a new regression sometimes worsening what we had > > before a lot. > > I could try to help to avoid this with some tricks with disambiguation. > > > Please update the wiki so everyone can benefit of them.
Heh, the trouble is that it's mostly implicit knowledge. I'll try to write up some strategies. > > > > I therefore blame the dictionary for all this, and no good > > disambiguation can be done without a decent tagging. I am tired of > > waiting for someone else to volunteer, every time someone shows up, she > > seems intimidated by the task and she eventually loses interest. > > What's the problem with the dictionary? > > (1) It assigns too many POS tags, making disambiguation difficult. > > (2) It lacks important POS tags, so disambiguation cannot help. > > If (1), I really can help by writing up some methods I have found for > Polish and English. I can read Spanish, so this should be fairly easy. > > > Mostly (2). Freeling tends to keep the lexic roots and calculate the > inflections, so the dictionary is rather incomplete. That means you should expand the dictionary a lot, IMHO. > To transform one adjective into an adverb, in English you use the suffix > `-ly` and in Spanish you use the suffix `-mente`: > > Equal --> equally > Igual --> igualmente > > I found 18340 candidates for suffixation in the Spanish dictionary for > this particular case. I'd add them to the dictionary. Why? Because these things might be false alarms, and removing them later by hand might be easier. > > Same for diminutives, augmentatives and superlatives. Depending on the > zone these may vary, but if you want to be fully inclusive > <https://es.wikipedia.org/wiki/Diminutivo>, you have to include 17 > diminutives, both genders; 9 augmentatives, both genders; 1 superlative, > both genders excluding the irregular forms > <https://es.wikipedia.org/wiki/Superlativo>. They apply to the same > ~18000 candidates. They are widely used in writing, so it is worth to > include them. Well, I don't really tag diminutives (except for the most frequent that are already included in the dictionary) and it doesn't really hurt. > It is quite common to attach some pronouns to the verb thus including > information about direct and/or indirect object, or passive/impersonal > voice. Combinations are hughe, some like: > > infinitive+pronoun as DO; example: from /subir/: /subir*me*/, > /subir*te*/, /subir*se*/, /subir*lo*/, /subir*la*/, /subirn*os*/, > /subir*os*/, /subir/*se*, /subir*los*/, /subir*las*/. > infinitive+pronoun as IO+pronoun as DO; example: from /subir/: > /subír*_te_me*/, /subír*_se_me*/, /subír*_me_te*/, /subír*_se_te*/, > /subír*_se_nos*/, /subír*_nos_los*/, /subír*_os_las* /etc. > imperative+pronoun as DO:; example: from /subir/:/súbeme/, /subid*me*/, > /súba*me*/, /súban*me*/; /súbe*te*/, /subí*os*/, /súba*se*/, /súban*se* > /etc. > imperative+pronoun as IO+pronoun as DO; example: from /subir/: > /súbe*_me_lo*/, /súbe*_te_lo*/, /súbe_*te*_*me*/, /subí*_os_las* /etc. I'd go for Jaume's strategies with Catalan. They are probably exactly suited to your situation. I would tokenize this internally if it doesn't lead to any ambiguity (you don't need a space to tokenize). I don't do this for Polish as we have lots of ambiguities: "miałem" might be past of "mieć" and the past would be "miał" + "em", "em" being the agglutinate for the first person singular, but it's also instrumentative of the noun "miał", which shouldn't be tokenized. We have a stream of tokens, and we would need to replace it with a graph (one edge for "miałem", another for "miał" + "em"), which is not exactly the nicest thing to play with. So I don't tokenize but have a non-tokenized list hardcoded in the dictionary. > > Gerund accepts the same derivation. > > These derivations are enough on themselves to justify some automation: > so far 18000 adjectives * (1 adverb + 17 * 2 diminutives + 9 * 2 > augmentatives + 1 augmentative) =~ 972000 words to include in the > dictionary. > If you ad all the pronominal derivations: 7654 verbs * ( (10 IPasDO + > (6! / 2!(6-4)!) IPasIODO) * 3 verbal tenses = 7654 * (10 + 15) * 3 =~ > 574000 words to include in the dictionary. > > It makes a total of aprox 1,5 millon words to include, excluding the > American inflections. That's a very small number overall. > > Today the dictionary holds ~660 000 words. The Polish dictionary has around 4 500 000 words. But hey, we have crazily inflected nouns ;) > > Other derivations imply prefixing like */re-/* or /*anti-*/. These can > be applied to both verbs (all conjugations), adjetives and nouns: > /*re*iniciar/, /*anti*person//al/, /*re*calificar/, /*re*incidente/, > /*re*forestar/. Some of this are present in the dictionary, some are > not, l will not include them into the account but it may be relevant. I only include productive prefixes for spelling, not for tagging. As Jaume mentioned, this may lead to a large number of false matches. > > > Let's discuss this in depth. I'd love to see Spanish better supported. > > > I see an opportunity using a word database with frequencies. It will > shortlist some potential exclussions. It won't be too complex using the > wikipedia dump. > > BTW I see that the wikipedia dump has some bias. It is mainly written in > indirect style and its neutrality reduces a lot the use of > augmentatives, diminutives, superlatives and imperatives. Subjunctive > and conditional are also less probable to happen. I was looking for some > style variation and I came up with the Wikisource > <https://en.wikisource.org/wiki/Wikisource:What_is_Wikisource%3F>. It > has some bias as well, like texts being old due to copyrights, or > old-fashion language, but I there is an opportunity here since it has > many kind of documents, from legal texts to science fiction novels, > original or translated texts etc. I used our indexer on a large newspaper corpus, and on some corpus of literary works. I agree, Wikipedia is strongly biased. I also have a large English blog corpus that I downloaded from somewhere, also indexed on my hdd. > > I tested the latest dump eswikisource-20160305-pages-articles.xml (~175 > Mb compressed, ~774 Mb uncompressed) and it seems to proofread OK with > the regular wikimedia checker. > I encourage you to check your language in the Wikimedia backup folder > <http://dumps.wikimedia.org/backup-index.html> and test it. You will > find an inmese English corpus, and respectable sizes for German, Polish, > Catalan and so. Thanks for the pointed. The more data, the better. > So growing the dictionary may improve speed at cost of memory, Don't worry about that. Finite state machines can easily accommodate a much larger dictionary. It's all pretty much regular, so the automaton will be small anyway, as lots of arcs will be shared. > Using statistics may reduce the size of the dictionary > by excluding the rarest/illegal constructions and the process should be > automated to keep it repeatable and predictable. This means to automate > the exclussions as well. Use exclusions only to make the quality of tagging higher. The performance of the BaseTagger is not a problem. Regards, Marcin ------------------------------------------------------------------------------ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel