I took a little break from the line break discussion, but now I try to collect and extend my main points from the various bug comments. My starting point is the approach suggested by Jukka Korpela in his criticism on the Unicode Standard Annex (UAX) #14: http://www.cs.tut.fi/~jkorpela/unicode/linebr.html
Basically, the generic (language-independent) line breaking rules should be as simple as possible while at the same time trying to respect the conventions of natural languages. Thus, each character should default to the kind of line breaking that was most likely expected of it in its natural context. UAX 14 names three principal styles to determine line break opportunities in different scripts: - Western: spaces and hyphens are used to determine breaks - East Asian: lines can break anywhere, unless prohibited - South East Asian: line breaks require morphological analysis <http://www.unicode.org/reports/tr14/tr14-20.html#BreakOpportunities> According to UAX 14, the Western and East Asian styles can be unified into a single set of specifications, whereas the South East Asian style requires more complicated, language-dependent hyphenation algorithms. Although, I suppose, the unified specification alone was not enough to fully cater for the needs of any language, it should be good enough for most cases in Western and East Asian languages. The default behavior of each character could be redefined and refined at the language-dependent level when necessary, but this should be treated as a separate issue, since the language of a document was not always easy to identify. I'll concentrate on discussing the properties of the Western and especially Latin scripts, since the Asian scripts are beyond my area of expertise. I recognize that some compromises may be necessary in order to make the line breaking system adequate for both the Western and East Asian users, but I think we have to start by considering the basis of each tradition independently. CONVENTIONAL LINE BREAKS IN LATIN SCRIPTS In Latin scripts, line break opportunities are basically marked with spaces. Additional break opportunities may be marked with hyphens or dashes. Breaking in any other place would generally be unconventional and potentially confusing. Technically, a break may usually occur only _after_ a character. In some languages, a break may be allowed even before an em-dash, but since this would be unexpected in other language contexts, it should be defined as a language-dependent exception. There are some special cases where a line break is not desirable even after a space, a hyphen or a dash. However, in most everyday cases the exceptions should be reasonably simple to specify: A line break is allowed after a space, a hyphen or a dash, unless (a) the space or hyphen is of the non-breaking type (reasoning: the very idea of a non-breaking character is to prohibit line break) (b) the hyphen or dash is adjacent to a space (reasoning: the basic function of a space is to separate two words from each other, so it seems apparent that a hyphen or dash _preceded_ by a space -- as in the expression "suffix -ed" -- is supposed to be a fixed part of the word it is directly connected to) (c) the hyphen or dash is adjacent to any punctuation (reasoning: combining a hyphen with other punctuation may imply many different kinds of ordinary or exceptional usage -- such as ASCII art -- where it is not desirable to break; however, since two or three adjacent hyphens were often used as a substitute for a single dash, a double hyphen might be considered equivalent to an en-dash and a triple hyphen equivalent to an em-dash, generally allowing a line break after the last hyphen) (d) there is no more than one alphabetic or symbol character on either side of the hyphen or dash (this would improve the typographical appearence by preventing widowed and orphaned characters at the start or end of a line; one might even consider preventing line breaks if there were no more than _two_ characters on either side, or allowing the user to define the best setting in the browser preferences). These minimal line breaking rules should cover the most important cases at least for Latin scripts (although I probably overlooked something, please feel free to append the list). A somewhat more detailed set of rules may be needed for numerical contexts, where a hyphen (or sometimes perhaps a dash) is often used as a minus sign. Note that disallowing line breaks altogether adjacent to a numeric character would not produce a desired effect for example in long chemical names, such as "2-bromo-4,4-dichlorophenol". Further exceptions could be specified at the language-dependent level, or by special "emergency break" rules for very long strings. Language-dependent additions Although language-dependent rules go beyond the scope of this discussion, it might be illustrative to consider briefly how the generic rules were appendable. As long as the document defined the language(s) used, it should be fairly easy to apply language-dependent additional rules, for example: - in English, a line break is allowed both before and after an em- dash, and irrespective of how many alphabetic or numeric characters there are on either side - in French, a line break is not allowed after a space if it is followed by an exclamation mark, a question mark, a colon, a semicolon or a closing guillemet, nor if it is preceded by an opening guillemet (as it is conventional to separate these characters with a space in French typography) - in Finnish, a line break is not allowed after a space if it is preceded by a hyphen (as there may occur cases such as "koulu- ja kirjastorakennus" -- referring to a combined school and library building -- and the combination of the words "koulu-" and "ja" should not be confused with the plural partitive form "kouluja" -- schools -- which could be hyphenated as "koulu-ja"). Of course one can come by many more language-dependent rules, but they can be added little by little, as native speakers start to point out deficiencies. However, one should consider very carefully the positive and negative effects and the necessity of each additional exception. For example, in the French and Finnish examples above, the undesired breaks can usually be prevented with a no-break space, so basically no special rules should be needed. On the other hand, writing a Unicode character or an HTML entity is often clumsy, and the result can be unpredictable (for example, just a couple of days ago I tried to use some HTML entities when commenting to a blog, but the entity codes ended up showing as regular text), so a plain space may be a safer choice after all. Perhaps one day, the rules may be appended to include even language- specific hyphenation algorithms, but for now, I suppose that's something we can only dream of. Non-natural languages Non-natural languages may require special consideration, but basically, they should follow the conventions of natural languages. In a technical notation, such as a URL or a sequence of programming language code, an unconventional line break may actually be even more confusing than in a natural language sentence. Natural languages usually contain a lot of redundancy, in order to make sure that occasional errors or distractions will not distort the whole message. Non-natural languages, however, usually strive for efficiency and depend on the data to be interpreted exactly as it is written. Thus, it may be crucial to know whether there is a space between two characters or not, but an unconventional line break would hide this essential detail. Misunderstanding UAX 14 Unfortunately, UAX 14 tends to obscure the basic line breaking principles for Latin scripts by describing the behavior of various characters in a very complicated way. It is easy to misunderstand UAX 14. For example, I was stunned when I read (in the third section of Table 1)* that closing punctuation -- such as ')' -- prohibits line breaks before, and that opening punctuation -- such as '(' -- prohibits line breaks after. *<http://www.unicode.org/reports/tr14/tr14-20.html#Table1> Since line breaks were not prohibited _after_ a closing parenthesis and _before_ an opening parenthesis, this seemed to imply that they should be allowed. However, it would be absurd to break as in the following examples: colo(u)ring colo(u) ring colo (u)ring After some reasoning, and with the help of the explanations found in the (rather long) Chapter 5.1, I realized that the idea is merely to overrule the default behavior of the nearest enclosed character (which in my examples is "u"), in the case that _it_ allows a line break before or after. These rules do not speak anything about how to break _outside_ the parentheses, but only how to not break _inside_ them. Perhaps it is exactly the confusing description in UAX 14 that has tricked even the IE designers to allow line breaks before and after parentheses (as well as in many other strange situations), regardless of whether there are spaces involved or not. This is definitely not correct in a Latin context (whereas in an East Asian context it may actually be preferable). LINE BREAKING AT A SLASH According to the conventional principles of Latin scripts, a slash would not be considered to offer a line break opportunity. Actually, a slash is rather rare in natural language contexts, but there are special expressions that depend on the presupposition that a word cannot be broken at a slash (for example, abbreviations "c/o" and "s/ he" would become more difficult to perceive if they were broken). The typographical line breaking conventions have been developed over a period of centuries, long before there were computers and URLs to worry about. Neither, it seems, were file-paths and URLs designed to take into account the typographical issue of how they should be presented in a horizontally limited space. Thus, as computers and the Web have become an important means of communication in our everyday life, it seems that some modifications to the conventional line breaking rules are needed. When analyzing the structure of a file-path, the most logical line break opportunity seems to be either immediately after or immediately before a slash. However, allowing line breaks indiscriminately at any slash would produce new problems. Thus, break opportunities should be limited to the special cases where they were considered really necessary, i.e., long file-paths and URLs. Perhaps the most straightforward way to identify breakable file-paths would be to count how many slashes there were in each string, since in natural language expressions there was rarely more than one slash. Even if there are two slashes in a file-path, the string as a whole is often so short that breaking it does not offer any significant typographical improvement. For example, it would be pointless to break a file-path such as "/etc/apt". Therefore, it might be considered reasonable to disallow breaks unless there were at least three slashes in a string. Even when there were three or more slashes and the string was broken, the reader should be given a hint that something exceptional happened and that the broken string was actually supposed to be interpreted as a single, continuous entity. Therefore, a break should not be allowed after the first slash. Seeing that there was no space after the first slash should give the reader a hint that perhaps there were no spaces after the other slashes either (although this would be deceiving in file-paths and URLs that _ended_ with a slash). Furthermore, if the last part of the string is also a regular word in the context language (as "apt" is a word in English), it may not always be clear whether the part separated by a line break belongs to the string or to the context. Therefore, a break should not be allowed after the last slash (nor after the first), but any other slash might be considered to offer a break opportunity: /etc/ foobar/apt This way, the presence of slashes on both lines would give the reader a hint that the parts did perhaps belong to the same string even though they were separated by an unconventional line break. However, even this solution leaves room for potential confusion. Sometimes a word is wrapped in slashes as if in parentheses or quotes -- like /this/ -- in order to simulate the appearance of italics. Furthermore, according to the International Phonetic Alphabet, slashes may be used in a similar fashion in order to describe the actual pronunciation of a word. Thus, there may occur cases such as: (1) /foobar/ and/or Now, consider the following file-path: (2) /foobar/and/or If broken before "and", both examples will look exactly the same: /foobar/ and/or In the first example, a line break after the space that precedes "and" would be perfectly conventional. In the second example, a line-break after the second slash would be unconventional and a potential cause for confusion. Therefore, a possibly better solution (as suggested above by David E. Ross) might be to allow breaks only _before_ a slash. In that case, the second example could be broken in two ways: /foobar /and/or /foobar/and /or This should prevent anybody from confusing a file-path to the special usage of simulating italics or marking pronunciation with slashes. Also, seeing a line beginning with a slash would warn the reader that there was something exceptional in the string, and if there were slashes even on the previous line, it shouldn't be too hard to conclude that the strings were somehow linked to each other. CONCLUSIONS The default behavior for most Latin characters is to not allow line breaks either before or after, whereas it seems that for most East Asian characters the default behavior is to allow line breaks both before and after. Obviously, if a Latin character is put adjacent to an East Asian character, their default behaviors conflict. There should be a consistent rule on how the conflict is solved. Since restricting the line breaks appears to be a significant problem in East Asian languages, perhaps it would be reasonable to allow an East Asian character to overrule the default breaking behavior of a Latin character if put adjacent to each other. However, if put into a Latin context, even a non-Latin character should rather be treated as a symbol character inherent in Latin scripts (and thus, line breaking would not be allowed), but this can be specified at the language- dependent level. This approach would not solve the problem that East Asian users expected even words written with Latin characters to break at any punctuation, but I'm afraid that this issue cannot be helped without violating the fundamental logic of Latin scripts. I have tried to illustrate the conditions of line breaking in Latin scripts and the potential problems caused by overlooking and adding exceptions to the conventional rules. Each exception should be considered very carefully because a relative improvement in typographical appearance can hardly be justified if the required adjustments can distort the actual message. The basic function of the art of typography is to make it easier for the reader to absorb information. If typographical solutions make the contents more difficult to understand, it is bad typography. _______________________________________________ dev-tech-layout mailing list [email protected] https://lists.mozilla.org/listinfo/dev-tech-layout

