[ https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636768#comment-16636768 ]
ASF GitHub Bot commented on OPENNLP-1214: ----------------------------------------- autayeu commented on issue #329: OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence… URL: https://github.com/apache/opennlp/pull/329#issuecomment-426588520 Given the goal of this improvement is to speed up, do you think below is a realistic test? Do you think it applies across other JVMs? ```java import java.util.ArrayList; import java.util.Arrays; import java.util.HashSet; import java.util.List; import java.util.Set; import opennlp.tools.sentdetect.lang.Factory; class Scratch { private static final int ITERATIONS = 100_000_000; private static Set<Character> eosCharacters; public static void main(String[] args) { eosCharacters = new HashSet<>(); for (char eosChar: Factory.defaultEosCharacters) { eosCharacters.add(eosChar); } char[] cbuf = new char[20]; System.out.println("defaultEosCharacters"); for (char eos : Factory.defaultEosCharacters) { Arrays.fill(cbuf, eos); testBuffer(cbuf); } System.out.println("ptEosCharacters"); for (char eos : Factory.ptEosCharacters) { Arrays.fill(cbuf, eos); testBuffer(cbuf); } System.out.println("jpnEosCharacters"); for (char eos : Factory.jpnEosCharacters) { Arrays.fill(cbuf, eos); testBuffer(cbuf); } } private static void testBuffer(char[] cbuf) { System.out.println("Testing with: " + new String(cbuf)); { long start = System.currentTimeMillis(); for (int n = 0; n < ITERATIONS; n++) { getPositionsArray(cbuf); } long duration = System.currentTimeMillis() - start; System.out.println("Duration array (ms): " + duration); } { long start = System.currentTimeMillis(); for (int n = 0; n < ITERATIONS; n++) { getPositionsHashset(cbuf); } long duration = System.currentTimeMillis() - start; System.out.println("Duration set (ms): " + duration); } } public static List<Integer> getPositionsArray(char[] cbuf) { List<Integer> l = new ArrayList<>(); char[] eosCharacters = Factory.defaultEosCharacters; for (int i = 0; i < cbuf.length; i++) { for (char eosCharacter : eosCharacters) { if (cbuf[i] == eosCharacter) { l.add(i); break; } } } return l; } public static List<Integer> getPositionsHashset(char[] cbuf) { List<Integer> l = new ArrayList<>(); for (int i = 0; i < cbuf.length; i++) { if (eosCharacters.contains(cbuf[i])) { l.add(i); } } return l; } } ``` ```bash "C:\Program Files\Java\jdk1.8.0_162\bin\java.exe" .... defaultEosCharacters Testing with: .................... Duration array (ms): 16424 Duration set (ms): 25844 Testing with: !!!!!!!!!!!!!!!!!!!! Duration array (ms): 17498 Duration set (ms): 26696 Testing with: ???????????????????? Duration array (ms): 17948 Duration set (ms): 25391 ptEosCharacters Testing with: .................... Duration array (ms): 16975 Duration set (ms): 25442 Testing with: ???????????????????? Duration array (ms): 18012 Duration set (ms): 25529 Testing with: !!!!!!!!!!!!!!!!!!!! Duration array (ms): 17562 Duration set (ms): 25579 Testing with: ;;;;;;;;;;;;;;;;;;;; Duration array (ms): 4040 Duration set (ms): 6223 Testing with: :::::::::::::::::::: Duration array (ms): 3991 Duration set (ms): 6276 Testing with: (((((((((((((((((((( Duration array (ms): 3980 Duration set (ms): 6185 Testing with: )))))))))))))))))))) Duration array (ms): 4043 Duration set (ms): 6199 Testing with: «««««««««««««««««««« Duration array (ms): 3971 Duration set (ms): 8503 Testing with: »»»»»»»»»»»»»»»»»»»» Duration array (ms): 3960 Duration set (ms): 8587 Testing with: '''''''''''''''''''' Duration array (ms): 3920 Duration set (ms): 5450 Testing with: """""""""""""""""""" Duration array (ms): 3931 Duration set (ms): 5396 jpnEosCharacters Testing with: 。。。。。。。。。。。。。。。。。。。。 Duration array (ms): 3974 Duration set (ms): 8616 Testing with: !!!!!!!!!!!!!!!!!!!! Duration array (ms): 3908 Duration set (ms): 9276 Testing with: ???????????????????? Duration array (ms): 3953 Duration set (ms): 9278 Process finished with exit code 0 ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > use hash to avoid linear search in DefaultEndOfSentenceScanner > -------------------------------------------------------------- > > Key: OPENNLP-1214 > URL: https://issues.apache.org/jira/browse/OPENNLP-1214 > Project: OpenNLP > Issue Type: Improvement > Affects Versions: 1.9.0 > Reporter: Koji Sekiguchi > Assignee: Koji Sekiguchi > Priority: Minor > Fix For: 1.9.1 > > > When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to > check if each characters in the sentence is one of eos characters. I think > we'd better use HashSet to keep eosCharacters instead of char[]. > In accordance with this replacement, I'd like to make > getEndOfSentenceCharacters() deprecated because it returns char[] and nobody > in OpenNLP calls it at present, and I'd like to add the equivalent method > which returns Set<Character> of eos chars. Though it cannot keep the order of > eos chars but I don't think it can be a problem anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005)