[GitHub] [opennlp] jzonthemtn commented on a change in pull request #355: OPENNLP-1266 -- Limit regexes in UrlCharSequenceNormalizer

GitBox Tue, 02 Jun 2020 07:49:15 -0700


jzonthemtn commented on a change in pull request #355:
URL: https://github.com/apache/opennlp/pull/355#discussion_r433935594




##########
File path: 
opennlp-tools/src/test/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizerTest.java
##########
@@ -44,4 +44,15 @@ public void normalizeEmail() throws Exception {
         "asdf   2nnfdf  ", normalizer.normalize("asdf [email protected]" 
+
             " 2nnfdf [email protected]"));
   }
+

Review comment:
       I think this is a good change but I do worry about limiting the length 
of the URL in the regex. What if we added an argument to the 
`UrlCharSequenceNormalizer` constructor to make this an option to the user? 
That way the user can choose between the trade off of speed vs. potentially 
missing URLs and there won't be any risk of changing the expected behavior of 
OpenNLP language detector applications out in the wild. Thoughts?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [opennlp] jzonthemtn commented on a change in pull request #355: OPENNLP-1266 -- Limit regexes in UrlCharSequenceNormalizer

Reply via email to