Status: Started
Owner: [email protected]
CC: [email protected],  [email protected],  [email protected]
Labels: Type-Bug Pri-2 OS-All Area-BrowserBackend Size-Medium i18n

New issue 21142 by [email protected]: Revise ICU 4.2.1's word breaking  
rules to handle segmentation of domain names per our expectation/requirement
http://code.google.com/p/chromium/issues/detail?id=21142

With ICU 4.2.1 in place of ICU 3.8, TextDatabaseManagerTest.InsertPartial
test fails. LayoutTests/css1/text_properties/text_transform.html also fails
(text-transform: capitalize)

It turned out that the word breaking rules changed (specified in UTR 29 :
http://unicode.org/reports/tr29/) and ICU 4.2.1 implemented new rules (#6
and #7):

Do not break letters across certain punctuation.

WB6.    ALetter ×       (MidLetter | MidNumLet) ALetter
WB7.    ALetter (MidLetter | MidNumLet) ×       ALetter


With them, 'e.g.' and 'i.e.' are considered as a single word, which is why
the layout test fails. The test expects capitalize("e.g.") to be "E.G." but
with the new rules, it's "E.g.".  I was about to rebaseline this test, but
InsertPartial failure is a different story.

InsertPartial fails because 'www.google.com' is considered a single word
and  'google' does not match 'www.google.com' because 'www.google.com' is
stored in sqlite's ft index but none of 'www', 'google', and 'com' is.


I took a look at the usage patterns of word break iterators in Chrome (and
sqlite's ICU-based tokenizer) and concluded that it's better to preserve
ICU 3.8 behavior.  We can do that at run-time, but then we'd have duplicate
rule strings in Chrome and SQlite. Even if it's put in only one place,
it'll increase the code size and slow things down (however little it may
be). Because we have the full control over our copy of ICU, I decided to
modify word.txt in ICU 4.2.1 to have ICU 3.8 behavior.





--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings

--~--~---------~--~----~------------~-------~--~----~
Automated mail from issue updates at http://crbug.com/
Subscription options: http://groups.google.com/group/chromium-bugs
-~----------~----~----~----~------~----~------~--~---

Reply via email to