Hi, W dniu 03.04.2016 o 12:46, Juan Martorell pisze: > Hi all, > > After a lot of thinking and with the aim of agility I decided to review > the current Spanish roadmap in order to make it more pragmatic and > sensible. I realize that disambiguation is quite important but without > correct tagging it is almost useless. I realized this because every rule > I added introduced a new regression sometimes worsening what we had > before a lot.
I could try to help to avoid this with some tricks with disambiguation. > That is one of the reasons why you see little activity in > Github on Spanish. It is because the rule set is in one stage in which a > little addition may cause little or big regression. A great example is > the regression > <https://languagetool.org/regression-tests/20160328/result_es_20160328.html> > introduced > by an apparently simple rule > <https://github.com/languagetool-org/languagetool/commit/615992a24f982787e332195ac58bc7c1d0ee8298>. > No new matches detected but three false positives popping up at the > Wikipedia tests. > > I therefore blame the dictionary for all this, and no good > disambiguation can be done without a decent tagging. I am tired of > waiting for someone else to volunteer, every time someone shows up, she > seems intimidated by the task and she eventually loses interest. What's the problem with the dictionary? (1) It assigns too many POS tags, making disambiguation difficult. (2) It lacks important POS tags, so disambiguation cannot help. If (1), I really can help by writing up some methods I have found for Polish and English. I can read Spanish, so this should be fairly easy. > > I think that improving the tagging at it very first stages will reduce > the complexity of disambiguation and rules. I will start on that but > first I'd like to hear your ideas on how to address it. > > The options I´ve thought of as of today are: > > 1.- *Customize the LT tagger* so compound words are recognized and > American idioms are recognized. Do you need to tag multiword expressions or compound words (written without a space)? These are two different things. If compounds have a hyphen, you might want to split only some of the words. See my solution here in org.languagetool.tokenizers.pl.PolishWordTokenizer -- there's a small number of different kinds of compounds. The tokenizer uses the tagger to check whether to split a word or not. > > 2.- *Preprocess tag dictionary* so all legal forms of lemmas PoS are > built by script, for both tag and synthesizing dictionaries. This might not be feasible for some compound forms. This really depends on how productive the compounds are. Could you give some examples? > > I think the later is the more pragmatical option since there is a good > starting point with the current Freeling borrowed tagger and approaches > more their roadmap. It saves processing at expense of dictionary size at > proofing time, but today computing is more expensive than storing, so it > looks like a good trade-off to me. Besides it avoids forking classes and > thus keeping the benefits of shared code with other languages. > > Any ideas? > > Such an endeavour means some extra coding, I guess awk will not be > enough to keep it clearly documented. Yes, awk is good for extremely simple tasks. Go for Java (or at least Python). > Data processing will need some > extra help, maybe some NoSQL solution to get advantage of big tables and > querying. As a language to use maybe Java could be used but Python may > cast more compact and readable code, or perhaps some other paradigm. > > I see little room for this in LT repository, so maybe I should start > somewhere else, like in LTLab <https://github.com/jmartorell/LTlab>, > incubate it somehow and integrate it in mainstream should it be > beneficial for some other languages. > > Comments will be highly appreciated. Let's discuss this in depth. I'd love to see Spanish better supported. Best Marcin > > On 6 June 2014 at 20:45, Juan Martorell <juan.martor...@gmail.com > <mailto:juan.martor...@gmail.com>> wrote: > > Hi all, > > I have been away for long, but that's other story that must be > proofread with the latest version of LT. > > I'm back again and there is a lot of work that needs to be done > before Spanish module is considered "stable". > > So I want to share with you my view on the roadmap for Spanish. > > *1st and foremost: disambiguator:* > > Developing a disambiguator is an endeavour harder than I initially > thought. A little change has large impact on rule triggering, > creating a "butterfy effect" that spreads across the language rules. > It can boost or plummet performance. > > So it is critical developing the disambiguator with high quality > from minute one. This is because the accuracy and complexity of the > rules in grammar.xml file are very sensitive to minor disambiguator > changes. > > Disambiguation changes the strategy of rule design and therefore the > rules should not grow too much until an effective disambiguation is > put into service. > > Thank you very much Marcin for the useful disambiguator logging. > > My current strategy for disambiguation is starting by the longer > constructions and then downsizing to the two tokens constructions. > Positive and negative examples should be included. > > *2nd stage: Dictionary* > > I've noticed several rules trigger because incorrect POS discloesd > by the fsa dictionary. This issues can be solved but there are > others that cannot. Some pronouns are attached to verbs and they > need to be identified to get a correct POS tag. > > *3rd stage: Rules* > > The aim for Spanish is and always has been creating a reduced > ruleset with meaningful rules. > > Pick rules for common mistakes. > Use inexpensive regular expressions > Simplify general rules starting from similar rules when possible. > Use synthesis for suggestions when possible. > > A rule that is seldom found in common texts but expensive should be > disabled by default. > > Rules are grouped by categories and by rule groups. It´s important > to put the rules where they belong so they are easy to find. > > *Helper tools* > > To ensure the quality of the rules I developed a set of tools > combining bash scripts and graphic diff tools with a varied corpus. > > They are basically isolated in a folder but they need to have access > to the deployed command line version of LT. > > I am keen to share them but I don't want to taint the LT code. > Should I do that in a separate Github project? It´s just an idea. > > Comments are welcome. > > Best regards, > Juan Martorell > > > > > ------------------------------------------------------------------------------ > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140 > > > > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel