2012/1/14 Marcin Miłkowski <list-addr...@wp.pl>:

>> Disabling the SRX tokenizer in Esperanto.java
>> also saved half a second at startup for Esperanto.
>>
>> I find it odd that SRX file src/resource/segment.srx
>> contains the rules for *all* languages. Wouldn't
>> it makes more sense to have a smaller SRX file
>> per language?  The function that loads it
>> (SRXSentenceTokenizer in
>> src/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java)
>> knows the language.
>
> But it would be against the SRX standard, which uses a hierarchy of
> processing. Note that for all languages, we have a few common rules
> (paragraph split etc.) and others could fall back on English etc. So
> splitting is not actually a problem, it is the feature of SRX to have
> these things together,.
>
> Anyway, the SRX tokenizer uses regular expressions, and if they are
> monsters (too many disjunctions), it could be slow. Probably optimizing
> the regexps in SRX will result in more saving than splitting and reading
> separately. Note that for other SRX languages, we don't have too much of
> an overhead.

I did an experiment and created a specialized resources/eo/segments.srx
for Esperanto, where I removed rules for other languages than Espranto.
But it does not to speed up anything. So we can leave it as it is. The
0.5 sec startup overhead of SRX intrigues me though.

Regarding the MySpell tagger, I did replace the Strings.matches()
with Pattern.matcher(...).matches() and it does save a bit of time (0.25 sec
at startup for the Ukrainian language):

Before change:

uk |     12 |  1.80 1.90 1.84

After change:

 uk |     12 |  1.55 1.56 1.55

Here is the patch. I'll checkin later:

Index: 
/home/pel/sb/languagetool/src/java/org/languagetool/tagging/uk/UkrainianMyspellTagger.java
===================================================================
--- 
/home/pel/sb/languagetool/src/java/org/languagetool/tagging/uk/UkrainianMyspellTagger.java
  (revision
6241)
+++ 
/home/pel/sb/languagetool/src/java/org/languagetool/tagging/uk/UkrainianMyspellTagger.java
  (working
copy)
@@ -26,6 +26,8 @@
 import java.util.ArrayList;
 import java.util.HashMap;
 import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;

 import org.languagetool.AnalyzedToken;
 import org.languagetool.AnalyzedTokenReadings;
@@ -68,6 +70,11 @@
       final BufferedReader input = new BufferedReader(new InputStreamReader(
           resourceFile, Charset.forName("UTF-8")));

+      final Pattern pattern1 = Pattern.compile("[abcdefghijklmnop]+");
+      final Pattern pattern2 = Pattern.compile("[ABCDEFGHIJKLMN]+");
+      final Pattern pattern3 = Pattern.compile("[BDFHJLN]+");
+      final Pattern pattern4 = Pattern.compile("[UV]+");
+
       String line;
       while ((line = input.readLine()) != null) {
         line = line.trim();
@@ -80,17 +87,17 @@
           final String flags = wrd[1];
           final List<String> posTags = new ArrayList<String>();

-          if (flags.matches("[abcdefghijklmnop]+")) {
+          if (pattern1.matcher(flags).matches()) {
             posTags.add(IPOSTag.TAG_NOUN);
             if (flags.equals("b")) {
               posTags.add(IPOSTag.TAG_PLURAL);
             }
-          } else if (flags.matches("[ABCDEFGHIJKLMN]+")) {
+          } else if (pattern2.matcher(flags).matches()) {
             posTags.add(IPOSTag.TAG_VERB);
-            if (flags.matches("^[BDFHJLN]+")) {
+            if (pattern3.matcher(flags).matches()) {
               posTags.add(IPOSTag.TAG_REFL);
             }
-          } else if (flags.matches("[UV]+")) {
+          } else if (pattern4.matcher(flags).matches()) {
             posTags.add(IPOSTag.TAG_ADJ);
           }

-- Dominique

------------------------------------------------------------------------------
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to