TLDR We replace Gecko’s segmenter code with ICU4X [*1] ’s segmenter that is compatible with UAX#14 [*2] and UAX#29 [*3].
Gecko's line/word segmenter was designed in pre-2000 and is one of the oldest codes in Gecko. The Unicode Consortium published the standard as "UAX#14 - Unicode Line Breaking Algorithm" and "UAX#29 - Unicode Text Segmentation" for segmentation rules that cover many languages after we did it. Unfortunately, Gecko’s segmentation isn’t compatible with this standard. Other web browsers (WebKit and Blink) use ICU4C for segmenter rules that are compatible with this standard, so this is a web compatibility issue. Now, Amazon, Google and Mozilla are working on ICU4X, which is Rust crates for I18N. Specifically, I and Ting-Yu Lin are working on a new segmenter crate in ICU4X. We decide that we use ICU4X for this new segmenter implementation in Gecko. It means that this is the first integration with the ICU4X project in Gecko. Bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1719535 Specification: https://www.unicode.org/reports/tr14/ and https://www.unicode.org/reports/tr29/ Standards Body: The Unicode Consortium Platform coverage: All Preference: intl.icu4x.segmenter.enabled DevTools bug: N/A Other Browsers: shipped web-platform-tests: https://wpt.fyi/results/css/css-text/line-breaking, https://wpt.fyi/results/css/css-text/i18n -- Makoto Kato / :m_kato *1 https://github.com/unicode-org/icu4x/ *2 https://www.unicode.org/reports/tr14/ *3 https://www.unicode.org/reports/tr29/ -- You received this message because you are subscribed to the Google Groups "[email protected]" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAP0dOsHawaK_nLWHLFpBkdL8JR67FcfsmVnS1J3c2e%2BeYGgeDw%40mail.gmail.com.
