[ 
https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636768#comment-16636768
 ] 

ASF GitHub Bot commented on OPENNLP-1214:
-----------------------------------------

autayeu commented on issue #329: OPENNLP-1214: use hash to avoid linear search 
in DefaultEndOfSentence…
URL: https://github.com/apache/opennlp/pull/329#issuecomment-426588520
 
 
   Given the goal of this improvement is to speed up, do you think below is a 
realistic test? Do you think it applies across other JVMs?
   
   ```java
   import java.util.ArrayList;
   import java.util.Arrays;
   import java.util.HashSet;
   import java.util.List;
   import java.util.Set;
   import opennlp.tools.sentdetect.lang.Factory;
   
   class Scratch {
   
     private static final int ITERATIONS = 100_000_000;
   
     private static Set<Character> eosCharacters;
   
     public static void main(String[] args) {
       eosCharacters = new HashSet<>();
       for (char eosChar: Factory.defaultEosCharacters) {
         eosCharacters.add(eosChar);
       }
   
       
       char[] cbuf = new char[20];
   
       System.out.println("defaultEosCharacters");
       for (char eos : Factory.defaultEosCharacters) {
         Arrays.fill(cbuf, eos);
         testBuffer(cbuf);
       }
   
       System.out.println("ptEosCharacters");
       for (char eos : Factory.ptEosCharacters) {
         Arrays.fill(cbuf, eos);
         testBuffer(cbuf);
       }
   
       System.out.println("jpnEosCharacters");
       for (char eos : Factory.jpnEosCharacters) {
         Arrays.fill(cbuf, eos);
         testBuffer(cbuf);
       }
     }
   
     private static void testBuffer(char[] cbuf) {
       System.out.println("Testing with: " + new String(cbuf));
       {
         long start = System.currentTimeMillis();
         for (int n = 0; n < ITERATIONS; n++) {
           getPositionsArray(cbuf);
         }
         long duration = System.currentTimeMillis() - start;
         System.out.println("Duration array (ms): " + duration);
       }
   
       {
         long start = System.currentTimeMillis();
         for (int n = 0; n < ITERATIONS; n++) {
           getPositionsHashset(cbuf);
         }
         long duration = System.currentTimeMillis() - start;
         System.out.println("Duration set (ms): " + duration);
       }
     }
   
     public static List<Integer> getPositionsArray(char[] cbuf) {
       List<Integer> l = new ArrayList<>();
       char[] eosCharacters = Factory.defaultEosCharacters;
       for (int i = 0; i < cbuf.length; i++) {
         for (char eosCharacter : eosCharacters) {
           if (cbuf[i] == eosCharacter) {
             l.add(i);
             break;
           }
         }
       }
       return l;
     }
   
     public static List<Integer> getPositionsHashset(char[] cbuf) {
       List<Integer> l = new ArrayList<>();
       for (int i = 0; i < cbuf.length; i++) {
         if (eosCharacters.contains(cbuf[i])) {
           l.add(i);
         }
       }
       return l;
     }
     
   }
   ```
   
   ```bash
   "C:\Program Files\Java\jdk1.8.0_162\bin\java.exe" ....
   defaultEosCharacters
   Testing with: ....................
   Duration array (ms): 16424
   Duration set (ms): 25844
   Testing with: !!!!!!!!!!!!!!!!!!!!
   Duration array (ms): 17498
   Duration set (ms): 26696
   Testing with: ????????????????????
   Duration array (ms): 17948
   Duration set (ms): 25391
   ptEosCharacters
   Testing with: ....................
   Duration array (ms): 16975
   Duration set (ms): 25442
   Testing with: ????????????????????
   Duration array (ms): 18012
   Duration set (ms): 25529
   Testing with: !!!!!!!!!!!!!!!!!!!!
   Duration array (ms): 17562
   Duration set (ms): 25579
   Testing with: ;;;;;;;;;;;;;;;;;;;;
   Duration array (ms): 4040
   Duration set (ms): 6223
   Testing with: ::::::::::::::::::::
   Duration array (ms): 3991
   Duration set (ms): 6276
   Testing with: ((((((((((((((((((((
   Duration array (ms): 3980
   Duration set (ms): 6185
   Testing with: ))))))))))))))))))))
   Duration array (ms): 4043
   Duration set (ms): 6199
   Testing with: ««««««««««««««««««««
   Duration array (ms): 3971
   Duration set (ms): 8503
   Testing with: »»»»»»»»»»»»»»»»»»»»
   Duration array (ms): 3960
   Duration set (ms): 8587
   Testing with: ''''''''''''''''''''
   Duration array (ms): 3920
   Duration set (ms): 5450
   Testing with: """"""""""""""""""""
   Duration array (ms): 3931
   Duration set (ms): 5396
   jpnEosCharacters
   Testing with: 。。。。。。。。。。。。。。。。。。。。
   Duration array (ms): 3974
   Duration set (ms): 8616
   Testing with: !!!!!!!!!!!!!!!!!!!!
   Duration array (ms): 3908
   Duration set (ms): 9276
   Testing with: ????????????????????
   Duration array (ms): 3953
   Duration set (ms): 9278
   
   Process finished with exit code 0
   
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> use hash to avoid linear search in DefaultEndOfSentenceScanner
> --------------------------------------------------------------
>
>                 Key: OPENNLP-1214
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1214
>             Project: OpenNLP
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.9.1
>
>
> When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to 
> check if each characters in the sentence is one of eos characters. I think 
> we'd better use HashSet to keep eosCharacters instead of char[].
> In accordance with this replacement, I'd like to make 
> getEndOfSentenceCharacters() deprecated because it returns char[] and nobody 
> in OpenNLP calls it at present, and I'd like to add the equivalent method 
> which returns Set<Character> of eos chars. Though it cannot keep the order of 
> eos chars but I don't think it can be a problem anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to