[ https://issues.apache.org/jira/browse/TIKA-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603483#comment-17603483 ]
Nick Burch commented on TIKA-3850: ---------------------------------- The kind of statistical language model used in Tika struggles with very short text. What happens if you feed a longer block of spanish language text in? > Spanish text is incorrectly detected as Galician > ------------------------------------------------ > > Key: TIKA-3850 > URL: https://issues.apache.org/jira/browse/TIKA-3850 > Project: Tika > Issue Type: Bug > Components: languageidentifier > Affects Versions: 2.4.1 > Environment: org.apache.tika:tika-langdetect-optimaize:2.4.1 > org.apache.tika:tika-core:2.4.1 > Reporter: Lenne Hendrickx > Priority: Minor > > The following Spanish text is incorrectly detected as Galician. > {noformat} > Hola! Donde puedo contactar para una garantÃa?{noformat} > The es and gl models are loaded into the language detector. > Language result: > {noformat} > language: gl > score: 0.999995{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)