Status: Started Owner: [email protected] CC: [email protected], [email protected], [email protected] Labels: Type-Bug Pri-2 OS-All Area-BrowserBackend Size-Medium i18n
New issue 21142 by [email protected]: Revise ICU 4.2.1's word breaking rules to handle segmentation of domain names per our expectation/requirement http://code.google.com/p/chromium/issues/detail?id=21142 With ICU 4.2.1 in place of ICU 3.8, TextDatabaseManagerTest.InsertPartial test fails. LayoutTests/css1/text_properties/text_transform.html also fails (text-transform: capitalize) It turned out that the word breaking rules changed (specified in UTR 29 : http://unicode.org/reports/tr29/) and ICU 4.2.1 implemented new rules (#6 and #7): Do not break letters across certain punctuation. WB6. ALetter × (MidLetter | MidNumLet) ALetter WB7. ALetter (MidLetter | MidNumLet) × ALetter With them, 'e.g.' and 'i.e.' are considered as a single word, which is why the layout test fails. The test expects capitalize("e.g.") to be "E.G." but with the new rules, it's "E.g.". I was about to rebaseline this test, but InsertPartial failure is a different story. InsertPartial fails because 'www.google.com' is considered a single word and 'google' does not match 'www.google.com' because 'www.google.com' is stored in sqlite's ft index but none of 'www', 'google', and 'com' is. I took a look at the usage patterns of word break iterators in Chrome (and sqlite's ICU-based tokenizer) and concluded that it's better to preserve ICU 3.8 behavior. We can do that at run-time, but then we'd have duplicate rule strings in Chrome and SQlite. Even if it's put in only one place, it'll increase the code size and slow things down (however little it may be). Because we have the full control over our copy of ICU, I decided to modify word.txt in ICU 4.2.1 to have ICU 3.8 behavior. -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings --~--~---------~--~----~------------~-------~--~----~ Automated mail from issue updates at http://crbug.com/ Subscription options: http://groups.google.com/group/chromium-bugs -~----------~----~----~----~------~----~------~--~---
