Daniel Naber wrote: > Hi, > > http://tatoeba.org is a free (CC-BY) collection of sentences in a lot of > languages. It complements Wikipedia as it contains a different style of > sentences, for example ones with the personal pronouns I, you, and we, > which rarely occur in the Wikipedia. I have modified the data on our > server to also use Tatoeba data. For now, it is active in the rule > editor to protect against false alarms. I will later activate it for our > Wikipedia check (which then isn't a pure Wikipedia check anymore...). > > If the rule editor says that your rule has been tested against 500,000 > articles, that's actually wrong - it's 500,000 sentences. Please update > the 'Community Website' translations at Transifex to get this fixed. > > For the rule editor, Tatoeba and Wikipedia are mixed 1:1, i.e. for each > sentence from Wikipedia, there's one sentence from Tatoeba. The whole > process is now based on sentences: there's a new class SentenceSource > that can be extended to feed sentences from some data source into > LanguageTool. For now, there are classes for Wikipedia XML dumps and the > Tatoeba CSV export. Tatoeba needs to be filtered first to only contain > the relevant language. > > The fact that everything is based on sentences now has a side effect > that we'll miss errors that span sentence boundaries, e.g. coherency > checks.
Thanks for that! Will sentences from Tatoeba be used in the daily diff? (http://www.languagetool.org/regression-tests/?C=M;O=D) Regards Dominique ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel