kinow commented on code in PR #516:
URL: https://github.com/apache/opennlp/pull/516#discussion_r1125635587


##########
opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java:
##########
@@ -30,10 +30,27 @@ public class Factory {
   private static final Pattern PORTUGUESE = 
Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");
   private static final Pattern FRENCH = 
Pattern.compile("^[a-zA-Z0-9àâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ]+$");
 
-  // For reference: https://www.sttmedia.com/characterfrequency-dutch
+  // From: https://www.sttmedia.com/characterfrequency-dutch
   private static final Pattern DUTCH = 
Pattern.compile("^[A-Za-z0-9äöüëèéïijÄÖÜËÉÈÏIJ]+$");
-  private static final Pattern GERMAN = 
Pattern.compile("^[A-Za-z0-9äöüÄÖÜß]+$");
 
+  // Note: The extra é and É are included to cover German "Lehnwörter" such as 
"Café"
+  private static final Pattern GERMAN = 
Pattern.compile("^[A-Za-z0-9äéöüÄÉÖÜß]+$");
+
+  // From: https://en.wikipedia.org/wiki/Polish_alphabet
+  //       https://pl.wikipedia.org/wiki/Alfabet_polski
+  private static final Pattern POLISH = 
Pattern.compile("^[A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ]+$");

Review Comment:
   Just my OCD here, @mawiesne , but could we keep the same order for lower 
case and upper case? :grimacing: 
   
   s/A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ/A-Za-z0-9żźćąśęłóńŻŹĆĄŚĘŁÓŃ (I was reading the 
upper case as "alphanum and Z Z Caselon", and thought it was an easy way to 
memorize it, so went with that for the lower case chars too, but we can change 
it if that makes more sense)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to