[ 
https://issues.apache.org/jira/browse/OPENNLP-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1528:
------------------------------------
    Issue Type: Task  (was: Bug)

> Review Catalan regexp for the ela germinada
> -------------------------------------------
>
>                 Key: OPENNLP-1528
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1528
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Bruno P. Kinoshita
>            Assignee: Bruno P. Kinoshita
>            Priority: Minor
>         Attachments: image-2023-12-11-15-20-31-518.png
>
>
> I shared on Twitter about the issue with the word "ós" found in our tokenizer 
> tests, and Joan Montané (unjoanqualsevol on Twitter) replied pointing that 
> our regexp for Catalan didn't seem right.
> Created this issue so we can test & fix it.
> > Regexp is not fully correct. Catalan written language uses middle dot / 
> >interpunct (U+00B7) as inner word character: cel·la, goril·la, instal·lar, 
> >cancel·lar,...  
> !image-2023-12-11-15-20-31-518.png|width=365,height=429!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to