Re: [Languagetool] Possible way of speeding up LanguageTool

2012-04-16 Thread Daniel Naber
On Montag, 16. April 2012, Jarek wrote:

Jarek,

 Attached the patch and modified files (as it may be easier just to
 replace them instead of applying patch). 

thanks for the patch. I'm getting a test case failure in 
EnglishUnpairedBracketsRuleTest with the patch applied. Can your reproduce 
that (calling ant test)?

Regards
 Daniel

-- 
http://www.danielnaber.de

--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] Possible way of speeding up LanguageTool

2012-01-14 Thread Jan Schreiber
Dominique Pellé wrote:
 The slowest startup times are for Chinese, German
 and Ukrainian. It would be interesting to find why.

That's true.

Some of my rules in the German file do checks against huge regular
expressions, which is probably quite inelegant. This might be a reason.

--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] Possible way of speeding up LanguageTool

2012-01-14 Thread Marcin Miłkowski
W dniu 2012-01-14 01:39, Jimmy O'Regan pisze:
 2012/1/13 Dominique Pellédominique.pe...@gmail.com:
 I see that Ukrainian tokenizer uses a MySpell
 (UkrainianMyspellTagger class) which reads and
 parses a text file dist/resource/uk/ukrainian.dict
 of 1,841,900 bytes, so that's not fast.   It is also
 using String.match(regexp) to parse it which is not fast.
 Using the Matcher class should be faster as indicated here:
   http://www.regular-expressions.info/java.html
 Anyway, transforming the MySpell file into a binary
 dictionary should be faster as described here:
 http://languagetool.wikidot.com/developing-a-tagger-dictionary

 It was presumably done that way because there was no generally
 available morphological dictionary for Ukrainian. UGTag has been
 available since last year or the year before, so a dictionary
 generated from that would probably be better.

As far as I remember, the tagger based on MySpell is not only slow but 
also is not used in any rules. UGTag is the way to go and we had it on 
our GSoC list. Adding it should be fairly trivial, it is already in Java 
so simply writing up an interface should be easy. But then again, we 
need someone who'd use the thing to write rules ;)

Regards
Marcin



--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] Possible way of speeding up LanguageTool

2012-01-14 Thread Yakov Reztsov

14 января 2012, 16:50 от Marcin Miłkowski 
 W dniu 2012-01-14 00:14, Dominique Pellé pisze:
  Dominique Pellé wrote:
 
  Marco A.G.Pinto wrote:
 
  Hello!
 
  I don't know how it was coded but I tried to run a script Daniel told me
  about and it took a lot of time.
 
  My suggestion is that when starting LibreOffice instead of reading from
  the XMLs, the contents were already in files containing arrays with the
  words.
 
  Let me give you an example for the English XML file:
  ARRAY:
  Possible Typos
  2
  ARE_STILL_THE_SOME
  IS_EVEN_WORST
  Grammar
  1
  WANT_THAT_I
  #END#
 
  This is a very simple example of optimization. For example, it has the
  type of grammar error and in the next position the number of entries of it
  (convert string to number). It ends with a #END#.
 
 
  Hi Marco
 
  I can't say that I like it. It would be messy. XML syntax can be
  checked for example. Anyway, I doubt that it would help to speed up.
 
  Before doing optimizations it is always necessary to measure first
  otherwise you might try to optimize something that takes only 1%
  of the time in the first place. Measuring also lets you verify objectively
  with numbers that whatever you change, actually helps to speed up.
 
  After reading your email, I was curious about the startup time (in
  command line, not in LibreOffice), so I created a script to measure
  it for all languages:
 
  For each language, the script:
 
  - counts the number of XML rules. I used the latest in SVN r6239
 so it can  different slightly from numbers at
 http://www.languagetool.org/languages/
  - measures startup time (3 times to avoid outliers) when
 launching LanguageTool with an empty sentence.
 
  Here is the script: http://dominique.pelle.free.fr/startup-time-lt.sh
 
  Here is the result:
 
  $ cd languagetool
  $ ./startup-time-lt.sh
 
  lang | #rules | startup time in sec (3 samples)
  -++
ast | 61 |  0.29 0.26 0.27
 br |353 |  0.41 0.39 0.40
 ca |214 |  0.83 0.83 0.83
 zh |328 |  2.34 2.30 2.28
 da | 22 |  0.82 0.80 0.79
 nl |336 |  0.88 0.88 0.89
 en |789 |  0.96 0.95 0.96
 eo |262 |  0.84 0.85 0.84
 fr |   2015 |  0.51 0.52 0.51
 gl |157 |  0.96 0.97 0.97
 de |717 |  1.74 1.91 1.76
 is | 39 |  0.77 0.82 0.81
 it | 94 |  0.28 0.28 0.28
 km | 24 |  0.88 0.84 0.83
 lt |  6 |  0.20 0.21 0.22
 ml | 23 |  0.81 0.78 0.81
 pl |   1029 |  1.17 1.17 1.16
 ro |452 |  0.99 0.98 0.94
 ru |149 |  0.95 0.91 0.92
 sk | 58 |  1.00 0.95 0.93
 sl | 86 |  0.80 0.86 0.83
 es | 70 |  0.91 0.85 0.84
 sv | 26 |  0.29 0.26 0.27
 tl | 44 |  0.25 0.26 0.25
 uk | 12 |  1.80 1.90 1.84
 
  This was measured on a 5 year old laptop
  (Linux x86, Intel(R) Core(TM) Duo CPU T2250  @ 1.73GHz)
 
  What's interesting here, is that the startup time
  does not strongly depend on the number of XML
  rules.  Ukrainian (uk) has only 12 xml rules, yet it
  is 3.52 times slower than French (fr) which has
  2015 rules!
 
  The slowest startup times are for Chinese, German
  and Ukrainian. It would be interesting to find why.
 
  Regards
  -- Dominique
 
 
  I spent a bit of time to find why the
  startup time is slow with Ukrainian (uk)
  (~1.80 sec) and yet it has only 12 xml rules:
 
  uk | 12 |  1.80 1.90 1.84
 
  First, I disabled the SRX tokenizer by
  commenting out getSentenceTokenizer()
  in src/java/org/languagetool/language/Ukrainian.java
  This saves about half a second to bring
  the startup time to:
 
 uk | 12 |  1.31 1.31 1.29
 
  Disabling the SRX tokenizer in Esperanto.java
  also saved half a second at startup for Esperanto.
 
  I find it odd that SRX file src/resource/segment.srx
  contains the rules for *all* languages. Wouldn't
  it makes more sense to have a smaller SRX file
  per language?  The function that loads it
  (SRXSentenceTokenizer in
  src/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java)
  knows the language.
 
 But it would be against the SRX standard, which uses a hierarchy of
 processing. Note that for all languages, we have a few common rules
 (paragraph split etc.) and others could fall back on English etc. So
 splitting is not actually a problem, it is the feature of SRX to have
 these things together,.
 
 Anyway, the SRX tokenizer uses regular expressions, and if they are
 monsters (too many disjunctions), it could be slow. Probably optimizing
 the regexps in SRX will result in more saving than splitting and reading
 separately. Note that for other SRX languages, we don't have too much of
 an overhead.

I try rewrite Ukrainian section in SRX file to remove some regular expressions.


--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!

Re: [Languagetool] Possible way of speeding up LanguageTool

2012-01-13 Thread Dominique Pellé
Marco A.G.Pinto wrote:

 Hello!

 I don't know how it was coded but I tried to run a script Daniel told me
about and it took a lot of time.

 My suggestion is that when starting LibreOffice instead of reading from
the XMLs, the contents were already in files containing arrays with the
words.

 Let me give you an example for the English XML file:
 ARRAY:
 Possible Typos
 2
 ARE_STILL_THE_SOME
 IS_EVEN_WORST
 Grammar
 1
 WANT_THAT_I
 #END#

 This is a very simple example of optimization. For example, it has the
type of grammar error and in the next position the number of entries of it
(convert string to number). It ends with a #END#.


Hi Marco

I can't say that I like it. It would be messy. XML syntax can be
checked for example. Anyway, I doubt that it would help to speed up.

Before doing optimizations it is always necessary to measure first
otherwise you might try to optimize something that takes only 1%
of the time in the first place. Measuring also lets you verify objectively
with numbers that whatever you change, actually helps to speed up.

After reading your email, I was curious about the startup time (in
command line, not in LibreOffice), so I created a script to measure
it for all languages:

For each language, the script:

- counts the number of XML rules. I used the latest in SVN r6239
  so it can  different slightly from numbers at
  http://www.languagetool.org/languages/
- measures startup time (3 times to avoid outliers) when
  launching LanguageTool with an empty sentence.

Here is the script: http://dominique.pelle.free.fr/startup-time-lt.sh

Here is the result:

$ cd languagetool
$ ./startup-time-lt.sh

lang | #rules | startup time in sec (3 samples)
-++
 ast | 61 |  0.29 0.26 0.27
  br |353 |  0.41 0.39 0.40
  ca |214 |  0.83 0.83 0.83
  zh |328 |  2.34 2.30 2.28
  da | 22 |  0.82 0.80 0.79
  nl |336 |  0.88 0.88 0.89
  en |789 |  0.96 0.95 0.96
  eo |262 |  0.84 0.85 0.84
  fr |   2015 |  0.51 0.52 0.51
  gl |157 |  0.96 0.97 0.97
  de |717 |  1.74 1.91 1.76
  is | 39 |  0.77 0.82 0.81
  it | 94 |  0.28 0.28 0.28
  km | 24 |  0.88 0.84 0.83
  lt |  6 |  0.20 0.21 0.22
  ml | 23 |  0.81 0.78 0.81
  pl |   1029 |  1.17 1.17 1.16
  ro |452 |  0.99 0.98 0.94
  ru |149 |  0.95 0.91 0.92
  sk | 58 |  1.00 0.95 0.93
  sl | 86 |  0.80 0.86 0.83
  es | 70 |  0.91 0.85 0.84
  sv | 26 |  0.29 0.26 0.27
  tl | 44 |  0.25 0.26 0.25
  uk | 12 |  1.80 1.90 1.84

This was measured on a 5 year old laptop
(Linux x86, Intel(R) Core(TM) Duo CPU T2250  @ 1.73GHz)

What's interesting here, is that the startup time
does not strongly depend on the number of XML
rules.  Ukrainian (uk) has only 12 xml rules, yet it
is 3.52 times slower than French (fr) which has
2015 rules!

The slowest startup times are for Chinese, German
and Ukrainian. It would be interesting to find why.

Regards
-- Dominique
--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] Possible way of speeding up LanguageTool

2012-01-13 Thread Dominique Pellé
Dominique Pellé wrote:

 Marco A.G.Pinto wrote:

 Hello!

 I don't know how it was coded but I tried to run a script Daniel told me
 about and it took a lot of time.

 My suggestion is that when starting LibreOffice instead of reading from
 the XMLs, the contents were already in files containing arrays with the
 words.

 Let me give you an example for the English XML file:
 ARRAY:
 Possible Typos
 2
 ARE_STILL_THE_SOME
 IS_EVEN_WORST
 Grammar
 1
 WANT_THAT_I
 #END#

 This is a very simple example of optimization. For example, it has the
 type of grammar error and in the next position the number of entries of it
 (convert string to number). It ends with a #END#.


 Hi Marco

 I can't say that I like it. It would be messy. XML syntax can be
 checked for example. Anyway, I doubt that it would help to speed up.

 Before doing optimizations it is always necessary to measure first
 otherwise you might try to optimize something that takes only 1%
 of the time in the first place. Measuring also lets you verify objectively
 with numbers that whatever you change, actually helps to speed up.

 After reading your email, I was curious about the startup time (in
 command line, not in LibreOffice), so I created a script to measure
 it for all languages:

 For each language, the script:

 - counts the number of XML rules. I used the latest in SVN r6239
   so it can  different slightly from numbers at
   http://www.languagetool.org/languages/
 - measures startup time (3 times to avoid outliers) when
   launching LanguageTool with an empty sentence.

 Here is the script: http://dominique.pelle.free.fr/startup-time-lt.sh

 Here is the result:

 $ cd languagetool
 $ ./startup-time-lt.sh

 lang | #rules | startup time in sec (3 samples)
 -++
  ast | 61 |  0.29 0.26 0.27
   br |    353 |  0.41 0.39 0.40
   ca |    214 |  0.83 0.83 0.83
   zh |    328 |  2.34 2.30 2.28
   da | 22 |  0.82 0.80 0.79
   nl |    336 |  0.88 0.88 0.89
   en |    789 |  0.96 0.95 0.96
   eo |    262 |  0.84 0.85 0.84
   fr |   2015 |  0.51 0.52 0.51
   gl |    157 |  0.96 0.97 0.97
   de |    717 |  1.74 1.91 1.76
   is | 39 |  0.77 0.82 0.81
   it | 94 |  0.28 0.28 0.28
   km | 24 |  0.88 0.84 0.83
   lt |  6 |  0.20 0.21 0.22
   ml | 23 |  0.81 0.78 0.81
   pl |   1029 |  1.17 1.17 1.16
   ro |    452 |  0.99 0.98 0.94
   ru |    149 |  0.95 0.91 0.92
   sk | 58 |  1.00 0.95 0.93
   sl | 86 |  0.80 0.86 0.83
   es | 70 |  0.91 0.85 0.84
   sv | 26 |  0.29 0.26 0.27
   tl | 44 |  0.25 0.26 0.25
   uk | 12 |  1.80 1.90 1.84

 This was measured on a 5 year old laptop
 (Linux x86, Intel(R) Core(TM) Duo CPU T2250  @ 1.73GHz)

 What's interesting here, is that the startup time
 does not strongly depend on the number of XML
 rules.  Ukrainian (uk) has only 12 xml rules, yet it
 is 3.52 times slower than French (fr) which has
 2015 rules!

 The slowest startup times are for Chinese, German
 and Ukrainian. It would be interesting to find why.

 Regards
 -- Dominique


I spent a bit of time to find why the
startup time is slow with Ukrainian (uk)
(~1.80 sec) and yet it has only 12 xml rules:

   uk | 12 |  1.80 1.90 1.84

First, I disabled the SRX tokenizer by
commenting out getSentenceTokenizer()
in src/java/org/languagetool/language/Ukrainian.java
This saves about half a second to bring
the startup time to:

  uk | 12 |  1.31 1.31 1.29

Disabling the SRX tokenizer in Esperanto.java
also saved half a second at startup for Esperanto.

I find it odd that SRX file src/resource/segment.srx
contains the rules for *all* languages. Wouldn't
it makes more sense to have a smaller SRX file
per language?  The function that loads it
(SRXSentenceTokenizer in
src/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java)
knows the language.

Then I disabled the Ukrainian tagger (commenting
out getTagger() in src/java/org/languagetool/language/Ukrainian.java).
This saved more than a  second to bring the
startup time to:

 uk | 12 |  0.21 0.22 0.21

I see that Ukrainian tokenizer uses a MySpell
(UkrainianMyspellTagger class) which reads and
parses a text file dist/resource/uk/ukrainian.dict
of 1,841,900 bytes, so that's not fast.   It is also
using String.match(regexp) to parse it which is not fast.
Using the Matcher class should be faster as indicated here:
  http://www.regular-expressions.info/java.html
Anyway, transforming the MySpell file into a binary
dictionary should be faster as described here:
http://languagetool.wikidot.com/developing-a-tagger-dictionary

Regards
-- Dominique

--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel