Re: [Languagetool] Possible way of speeding up LanguageTool
On Montag, 16. April 2012, Jarek wrote: Jarek, Attached the patch and modified files (as it may be easier just to replace them instead of applying patch). thanks for the patch. I'm getting a test case failure in EnglishUnpairedBracketsRuleTest with the patch applied. Can your reproduce that (calling ant test)? Regards Daniel -- http://www.danielnaber.de -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Possible way of speeding up LanguageTool
Dominique Pellé wrote: The slowest startup times are for Chinese, German and Ukrainian. It would be interesting to find why. That's true. Some of my rules in the German file do checks against huge regular expressions, which is probably quite inelegant. This might be a reason. -- RSA(R) Conference 2012 Mar 27 - Feb 2 Save $400 by Jan. 27 Register now! http://p.sf.net/sfu/rsa-sfdev2dev2 ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Possible way of speeding up LanguageTool
W dniu 2012-01-14 01:39, Jimmy O'Regan pisze: 2012/1/13 Dominique Pellédominique.pe...@gmail.com: I see that Ukrainian tokenizer uses a MySpell (UkrainianMyspellTagger class) which reads and parses a text file dist/resource/uk/ukrainian.dict of 1,841,900 bytes, so that's not fast. It is also using String.match(regexp) to parse it which is not fast. Using the Matcher class should be faster as indicated here: http://www.regular-expressions.info/java.html Anyway, transforming the MySpell file into a binary dictionary should be faster as described here: http://languagetool.wikidot.com/developing-a-tagger-dictionary It was presumably done that way because there was no generally available morphological dictionary for Ukrainian. UGTag has been available since last year or the year before, so a dictionary generated from that would probably be better. As far as I remember, the tagger based on MySpell is not only slow but also is not used in any rules. UGTag is the way to go and we had it on our GSoC list. Adding it should be fairly trivial, it is already in Java so simply writing up an interface should be easy. But then again, we need someone who'd use the thing to write rules ;) Regards Marcin -- RSA(R) Conference 2012 Mar 27 - Feb 2 Save $400 by Jan. 27 Register now! http://p.sf.net/sfu/rsa-sfdev2dev2 ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Possible way of speeding up LanguageTool
14 января 2012, 16:50 от Marcin Miłkowski W dniu 2012-01-14 00:14, Dominique Pellé pisze: Dominique Pellé wrote: Marco A.G.Pinto wrote: Hello! I don't know how it was coded but I tried to run a script Daniel told me about and it took a lot of time. My suggestion is that when starting LibreOffice instead of reading from the XMLs, the contents were already in files containing arrays with the words. Let me give you an example for the English XML file: ARRAY: Possible Typos 2 ARE_STILL_THE_SOME IS_EVEN_WORST Grammar 1 WANT_THAT_I #END# This is a very simple example of optimization. For example, it has the type of grammar error and in the next position the number of entries of it (convert string to number). It ends with a #END#. Hi Marco I can't say that I like it. It would be messy. XML syntax can be checked for example. Anyway, I doubt that it would help to speed up. Before doing optimizations it is always necessary to measure first otherwise you might try to optimize something that takes only 1% of the time in the first place. Measuring also lets you verify objectively with numbers that whatever you change, actually helps to speed up. After reading your email, I was curious about the startup time (in command line, not in LibreOffice), so I created a script to measure it for all languages: For each language, the script: - counts the number of XML rules. I used the latest in SVN r6239 so it can different slightly from numbers at http://www.languagetool.org/languages/ - measures startup time (3 times to avoid outliers) when launching LanguageTool with an empty sentence. Here is the script: http://dominique.pelle.free.fr/startup-time-lt.sh Here is the result: $ cd languagetool $ ./startup-time-lt.sh lang | #rules | startup time in sec (3 samples) -++ ast | 61 | 0.29 0.26 0.27 br |353 | 0.41 0.39 0.40 ca |214 | 0.83 0.83 0.83 zh |328 | 2.34 2.30 2.28 da | 22 | 0.82 0.80 0.79 nl |336 | 0.88 0.88 0.89 en |789 | 0.96 0.95 0.96 eo |262 | 0.84 0.85 0.84 fr | 2015 | 0.51 0.52 0.51 gl |157 | 0.96 0.97 0.97 de |717 | 1.74 1.91 1.76 is | 39 | 0.77 0.82 0.81 it | 94 | 0.28 0.28 0.28 km | 24 | 0.88 0.84 0.83 lt | 6 | 0.20 0.21 0.22 ml | 23 | 0.81 0.78 0.81 pl | 1029 | 1.17 1.17 1.16 ro |452 | 0.99 0.98 0.94 ru |149 | 0.95 0.91 0.92 sk | 58 | 1.00 0.95 0.93 sl | 86 | 0.80 0.86 0.83 es | 70 | 0.91 0.85 0.84 sv | 26 | 0.29 0.26 0.27 tl | 44 | 0.25 0.26 0.25 uk | 12 | 1.80 1.90 1.84 This was measured on a 5 year old laptop (Linux x86, Intel(R) Core(TM) Duo CPU T2250 @ 1.73GHz) What's interesting here, is that the startup time does not strongly depend on the number of XML rules. Ukrainian (uk) has only 12 xml rules, yet it is 3.52 times slower than French (fr) which has 2015 rules! The slowest startup times are for Chinese, German and Ukrainian. It would be interesting to find why. Regards -- Dominique I spent a bit of time to find why the startup time is slow with Ukrainian (uk) (~1.80 sec) and yet it has only 12 xml rules: uk | 12 | 1.80 1.90 1.84 First, I disabled the SRX tokenizer by commenting out getSentenceTokenizer() in src/java/org/languagetool/language/Ukrainian.java This saves about half a second to bring the startup time to: uk | 12 | 1.31 1.31 1.29 Disabling the SRX tokenizer in Esperanto.java also saved half a second at startup for Esperanto. I find it odd that SRX file src/resource/segment.srx contains the rules for *all* languages. Wouldn't it makes more sense to have a smaller SRX file per language? The function that loads it (SRXSentenceTokenizer in src/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java) knows the language. But it would be against the SRX standard, which uses a hierarchy of processing. Note that for all languages, we have a few common rules (paragraph split etc.) and others could fall back on English etc. So splitting is not actually a problem, it is the feature of SRX to have these things together,. Anyway, the SRX tokenizer uses regular expressions, and if they are monsters (too many disjunctions), it could be slow. Probably optimizing the regexps in SRX will result in more saving than splitting and reading separately. Note that for other SRX languages, we don't have too much of an overhead. I try rewrite Ukrainian section in SRX file to remove some regular expressions. -- RSA(R) Conference 2012 Mar 27 - Feb 2 Save $400 by Jan. 27 Register now!
Re: [Languagetool] Possible way of speeding up LanguageTool
Marco A.G.Pinto wrote: Hello! I don't know how it was coded but I tried to run a script Daniel told me about and it took a lot of time. My suggestion is that when starting LibreOffice instead of reading from the XMLs, the contents were already in files containing arrays with the words. Let me give you an example for the English XML file: ARRAY: Possible Typos 2 ARE_STILL_THE_SOME IS_EVEN_WORST Grammar 1 WANT_THAT_I #END# This is a very simple example of optimization. For example, it has the type of grammar error and in the next position the number of entries of it (convert string to number). It ends with a #END#. Hi Marco I can't say that I like it. It would be messy. XML syntax can be checked for example. Anyway, I doubt that it would help to speed up. Before doing optimizations it is always necessary to measure first otherwise you might try to optimize something that takes only 1% of the time in the first place. Measuring also lets you verify objectively with numbers that whatever you change, actually helps to speed up. After reading your email, I was curious about the startup time (in command line, not in LibreOffice), so I created a script to measure it for all languages: For each language, the script: - counts the number of XML rules. I used the latest in SVN r6239 so it can different slightly from numbers at http://www.languagetool.org/languages/ - measures startup time (3 times to avoid outliers) when launching LanguageTool with an empty sentence. Here is the script: http://dominique.pelle.free.fr/startup-time-lt.sh Here is the result: $ cd languagetool $ ./startup-time-lt.sh lang | #rules | startup time in sec (3 samples) -++ ast | 61 | 0.29 0.26 0.27 br |353 | 0.41 0.39 0.40 ca |214 | 0.83 0.83 0.83 zh |328 | 2.34 2.30 2.28 da | 22 | 0.82 0.80 0.79 nl |336 | 0.88 0.88 0.89 en |789 | 0.96 0.95 0.96 eo |262 | 0.84 0.85 0.84 fr | 2015 | 0.51 0.52 0.51 gl |157 | 0.96 0.97 0.97 de |717 | 1.74 1.91 1.76 is | 39 | 0.77 0.82 0.81 it | 94 | 0.28 0.28 0.28 km | 24 | 0.88 0.84 0.83 lt | 6 | 0.20 0.21 0.22 ml | 23 | 0.81 0.78 0.81 pl | 1029 | 1.17 1.17 1.16 ro |452 | 0.99 0.98 0.94 ru |149 | 0.95 0.91 0.92 sk | 58 | 1.00 0.95 0.93 sl | 86 | 0.80 0.86 0.83 es | 70 | 0.91 0.85 0.84 sv | 26 | 0.29 0.26 0.27 tl | 44 | 0.25 0.26 0.25 uk | 12 | 1.80 1.90 1.84 This was measured on a 5 year old laptop (Linux x86, Intel(R) Core(TM) Duo CPU T2250 @ 1.73GHz) What's interesting here, is that the startup time does not strongly depend on the number of XML rules. Ukrainian (uk) has only 12 xml rules, yet it is 3.52 times slower than French (fr) which has 2015 rules! The slowest startup times are for Chinese, German and Ukrainian. It would be interesting to find why. Regards -- Dominique -- RSA(R) Conference 2012 Mar 27 - Feb 2 Save $400 by Jan. 27 Register now! http://p.sf.net/sfu/rsa-sfdev2dev2___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Possible way of speeding up LanguageTool
Dominique Pellé wrote: Marco A.G.Pinto wrote: Hello! I don't know how it was coded but I tried to run a script Daniel told me about and it took a lot of time. My suggestion is that when starting LibreOffice instead of reading from the XMLs, the contents were already in files containing arrays with the words. Let me give you an example for the English XML file: ARRAY: Possible Typos 2 ARE_STILL_THE_SOME IS_EVEN_WORST Grammar 1 WANT_THAT_I #END# This is a very simple example of optimization. For example, it has the type of grammar error and in the next position the number of entries of it (convert string to number). It ends with a #END#. Hi Marco I can't say that I like it. It would be messy. XML syntax can be checked for example. Anyway, I doubt that it would help to speed up. Before doing optimizations it is always necessary to measure first otherwise you might try to optimize something that takes only 1% of the time in the first place. Measuring also lets you verify objectively with numbers that whatever you change, actually helps to speed up. After reading your email, I was curious about the startup time (in command line, not in LibreOffice), so I created a script to measure it for all languages: For each language, the script: - counts the number of XML rules. I used the latest in SVN r6239 so it can different slightly from numbers at http://www.languagetool.org/languages/ - measures startup time (3 times to avoid outliers) when launching LanguageTool with an empty sentence. Here is the script: http://dominique.pelle.free.fr/startup-time-lt.sh Here is the result: $ cd languagetool $ ./startup-time-lt.sh lang | #rules | startup time in sec (3 samples) -++ ast | 61 | 0.29 0.26 0.27 br | 353 | 0.41 0.39 0.40 ca | 214 | 0.83 0.83 0.83 zh | 328 | 2.34 2.30 2.28 da | 22 | 0.82 0.80 0.79 nl | 336 | 0.88 0.88 0.89 en | 789 | 0.96 0.95 0.96 eo | 262 | 0.84 0.85 0.84 fr | 2015 | 0.51 0.52 0.51 gl | 157 | 0.96 0.97 0.97 de | 717 | 1.74 1.91 1.76 is | 39 | 0.77 0.82 0.81 it | 94 | 0.28 0.28 0.28 km | 24 | 0.88 0.84 0.83 lt | 6 | 0.20 0.21 0.22 ml | 23 | 0.81 0.78 0.81 pl | 1029 | 1.17 1.17 1.16 ro | 452 | 0.99 0.98 0.94 ru | 149 | 0.95 0.91 0.92 sk | 58 | 1.00 0.95 0.93 sl | 86 | 0.80 0.86 0.83 es | 70 | 0.91 0.85 0.84 sv | 26 | 0.29 0.26 0.27 tl | 44 | 0.25 0.26 0.25 uk | 12 | 1.80 1.90 1.84 This was measured on a 5 year old laptop (Linux x86, Intel(R) Core(TM) Duo CPU T2250 @ 1.73GHz) What's interesting here, is that the startup time does not strongly depend on the number of XML rules. Ukrainian (uk) has only 12 xml rules, yet it is 3.52 times slower than French (fr) which has 2015 rules! The slowest startup times are for Chinese, German and Ukrainian. It would be interesting to find why. Regards -- Dominique I spent a bit of time to find why the startup time is slow with Ukrainian (uk) (~1.80 sec) and yet it has only 12 xml rules: uk | 12 | 1.80 1.90 1.84 First, I disabled the SRX tokenizer by commenting out getSentenceTokenizer() in src/java/org/languagetool/language/Ukrainian.java This saves about half a second to bring the startup time to: uk | 12 | 1.31 1.31 1.29 Disabling the SRX tokenizer in Esperanto.java also saved half a second at startup for Esperanto. I find it odd that SRX file src/resource/segment.srx contains the rules for *all* languages. Wouldn't it makes more sense to have a smaller SRX file per language? The function that loads it (SRXSentenceTokenizer in src/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java) knows the language. Then I disabled the Ukrainian tagger (commenting out getTagger() in src/java/org/languagetool/language/Ukrainian.java). This saved more than a second to bring the startup time to: uk | 12 | 0.21 0.22 0.21 I see that Ukrainian tokenizer uses a MySpell (UkrainianMyspellTagger class) which reads and parses a text file dist/resource/uk/ukrainian.dict of 1,841,900 bytes, so that's not fast. It is also using String.match(regexp) to parse it which is not fast. Using the Matcher class should be faster as indicated here: http://www.regular-expressions.info/java.html Anyway, transforming the MySpell file into a binary dictionary should be faster as described here: http://languagetool.wikidot.com/developing-a-tagger-dictionary Regards -- Dominique -- RSA(R) Conference 2012 Mar 27 - Feb 2 Save $400 by Jan. 27 Register now! http://p.sf.net/sfu/rsa-sfdev2dev2 ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel