Detection confidence on non-sensical text changes with different line separators

Max Rumford Fri, 12 May 2023 08:05:14 -0700

Hello there!

I was having an issue with language detection on Windows systems Vs UNIX systems. I.e. text, that did not contain actual language and was separated by carriage return and line feed received vastly different confidence scores than the same text only separated by line feed, where the text only separated by line feed received lower scores (in the below example) in another language than cr/lf.

cr/lf detected "en" with a confidence score of 0.428568

lf detected "tl" with a confidence score of 0.285737

The text in question goes as follows:

67 IE 6 / IL TH 0 :XVd +» 68 ZI / ILZR O0 :NOA
N3SDNITNS ZEZLZ-A » LI ISTIXAIH

Mm) u

NA YNNHNN3U IH, VAN Any MAN
! +
Annas wval INMMD!LUEAY HUN Ay

u

NAYHBN yon wasuy n? 132 2799047
ana um IAMSIN3Y 2.P ° maland?ny ZHP!N
ar 34217 1.24 0839

u Sant An aalamy) M

gay po he ng 2007 mau WM

| 97m?

139274 Wasun my Ha-T1I3!CHRS THANLF
„oem WM MAHNON 370% yo 20 hin 14N0494L

4104 439 WAS IN SYR I CAaNAZNINFAL
na ?4N oToAlugy yrd 2.449339 TAI

hsldbh vn SOH9WN
4 / 0 9 7 ! Se WOA IYPLUDEN] Dumas “uayDt>Z ur

WOA JUDWUYDEN 3IUJ (Uay27Z asy]

Yynyudn 07004
01670) Ada]

H MALEHH2AN2YN
1m 01097. PU

4y asrarh

QHYIMANVT

I'm currently not able to open a Jira-account and thus cannot create a bug ticket. I would gladly take an acception of my previously automatically declined account creation.
Best regards
Max

Detection confidence on non-sensical text changes with different line separators

Reply via email to