Re: internationalization assumption

Philippe Verdy Thu, 07 Oct 2004 04:30:43 -0700

RE: internationalization assumptionWell the main issue for internationalization of software is not the character sets with which it was tested. It is in fact trivial today to make an application compliant with Unicode text encoding.

What is more complicate is to make sure that the text will be properly displayed. The main issues that cause most of the problems come in the following area:

- dialogs and GUI interfaces need to be resized according to text lengths

- a GUI may have been built with a limited set of fonts, all of them with the same line height for the same point size; if you have to display Thai characters, you'll need a larger line height for the same point size.

- some scripts are not readable at small point sizes, notably Han sinograms or Arabic

- the GUI layout should be preferably reversed for RTL languages.

- you need to be aware BiDi algorithm and you'll have to manage the case of mixed directions each time you have to include portions of texts from a general LTR script within a RTL interface (for Hebrew or Arabic notably): ignoring that, your application will not insert the appropriate BiDi controls that are needed to properly order the rendered text, notably for mirrored characters such as parentheses. For some variable inclusions in a RTL resource string, you may need to insert some surrounding RLE/PDF pair so that the embedded Latin items will display correctly.

- The GUI controls such as input boxes need should be properly aligned so that input will be performed from the correct side.

- Tabular data may have to be presented with distinct alignments, notably if items are truncated in narrow but extensible columns (traditionally, tabular text items are aligned on the left and truncated on the right, but for Hebrew or Arabic, they should be aligned and truncated in the opposite direction)

- You have to be aware of the variation of scripts that may be used even in a pure RTL interface: a user may need to enter sections of texts in another script, most often Latin. You have to wonder how these foreign text items will be handled.

- In editable parts of the GUI, mouse selection will be more complex than what you think, notably with mixed RTL/LTR scripts.

- You can't assume that all text will be readable with a fixed-width font. Some scripts require using variable-width letters.

- You have to worry about grapheme clusters, notably in Hebrew, Arabic, and nearly all Indian scripts. This is more complex than what you think for Latin, Gree, Cyrillic, Han, Hiragana or Katakana texts. Even with the Latin script, you can't assume that all grapheme clusters will be made of only 1 character. For various reasons, common texts will be entered using combining characters, without the possibility to make precomposed clusters (this is specially true for modern Vietnamese that uses multiple diacritics on the same letter).

- Text handling routines, that change the presentation of text (such as capitalisation) will not work properly or will not be reversible: even in the Latin script, there are some characters which are available with only 1 case. Titlecasing is another issue. Such automated presentation effects should be avoided, unless you are aware of the problem.

- Plain-text searches often need to support indifferent case. This issue is closely related to collation order, which is sensitive to local linguistic conventions, and not only to the used script. For example, plain-text search in Hebrew will often need to support searches with or without vowel marks, which are combining characters, simply because they are optional in the language. When this is used to search and match identifiers such as usernames or filenames, various options will be exposed to you. In addition, there are lots of legacy text that are not coded with the most accurate Unicode character, simply because they are entered with more restricted input methods or keyboards, or were coded with more restricted legacy charsets (the 'oe' ligature in French is typical: it is absent from ISO-8859-1 and from standard French keyboards, although it is a mandatory character for the language; however it is present in Windows codepage 1252, and may be present in texts coded with it, because itwill be entered through "assisted" editors or word processors that can perform autocorrection of ligatures on the fly)

- GUI keyboard accelerators may not be workable with some scripts: you can't assume that the displayed menu items will contain a matching ASCII letter, so you'll need some way to allow keyboard navigation of the interface. This issue is related to accessibility guidelines: you need to offer a way for users to see which keyboard accelerators they can use to navigate easily in your interface. Don't assume that accelerators for one language will be used as easily for another language.

- toolbar buttons should avoid graphic icons with text elements, unless these items are also internationalizable.

- color coding to add special semantics to text, or even to icons should be avoided, such as the too common European meanings of Red/Orange/Green.

- Sometimes, it will be hard to summarize in a short button label the actions it performs. Using help tooltip texts (also internationalizable) will provide better experience for users, when these buttons need to display abbreviations.

The other internationalization issues are much simpler: date and number formats, common words like Yes/No/OK/Cancel/Retry/Abort, are easily solved with text resources and common i18n libraries, such as the basic common set of CLDR resources.

----- Original Message ----- From: Mike Ayers For Unicode applications, Latin 1 testing is insufficient, even for internationalization testing. Internationalization tests should verify, at minimum, that characters >u1000 <=uffff (basically, all of the BMP) can be used. It is also good to verify >=u10000 support, or at least determine whether or not it exists for your application. I usually test English and Japanese for BMP conformance. For >BMP, while all the applications I've tested so far have specifically excluded this range, I still have a simple strategy based upon snipping the Deseret text from James Kass' script links page (http://home.att.net/~jameskass/scriptlinks.htm) and using that (thanks, James!). Note that none of the above at all refers to localization testing, which still must be done for every supported language-charset combination (this is where Unicode can really pay off by reducing things to 1 charset per language). Internationalization testing should only determine the ability of your application to handle other languages, it is localization testing that determines whether it actually handles a given language, and would include such things as text entry and display, text conversion, coextistence, etc., as applicable.

Re: internationalization assumption

Reply via email to