kinow commented on a change in pull request #399:
URL: https://github.com/apache/opennlp/pull/399#discussion_r776977532



##########
File path: 
opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
##########
@@ -26,7 +26,7 @@
   private static final Pattern URL_REGEX =
       Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]+");
   private static final Pattern MAIL_REGEX =
-      Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
+      
Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");

Review comment:
       I'd need more time to confirm what the lookbehind is doing. But from a 
first look, I like that the new regular expression understands emails like 
"[email protected]".
   
   The previous regex would capture "[email protected]", missing the "someone+" 
part I think. Which can lead to really strange results, depending how that 
value is used in models/systems.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to