[ https://issues.apache.org/jira/browse/OPENNLP-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martin Wiesner updated OPENNLP-1528: ------------------------------------ Issue Type: Task (was: Bug) > Review Catalan regexp for the ela germinada > ------------------------------------------- > > Key: OPENNLP-1528 > URL: https://issues.apache.org/jira/browse/OPENNLP-1528 > Project: OpenNLP > Issue Type: Task > Reporter: Bruno P. Kinoshita > Assignee: Bruno P. Kinoshita > Priority: Minor > Attachments: image-2023-12-11-15-20-31-518.png > > > I shared on Twitter about the issue with the word "ós" found in our tokenizer > tests, and Joan Montané (unjoanqualsevol on Twitter) replied pointing that > our regexp for Catalan didn't seem right. > Created this issue so we can test & fix it. > > Regexp is not fully correct. Catalan written language uses middle dot / > >interpunct (U+00B7) as inner word character: cel·la, goril·la, instal·lar, > >cancel·lar,... > !image-2023-12-11-15-20-31-518.png|width=365,height=429! -- This message was sent by Atlassian Jira (v8.20.10#820010)