Re: Line Breaking

lisika Sat, 04 Aug 2007 06:30:05 -0700

I took a little break from the line break discussion, but now I try to
collect and extend my main points from the various bug comments. My
starting point is the approach suggested by Jukka Korpela in his
criticism on the Unicode Standard Annex (UAX) #14:
http://www.cs.tut.fi/~jkorpela/unicode/linebr.html


Basically, the generic (language-independent) line breaking rules
should be as simple as possible while at the same time trying to
respect the conventions of natural languages. Thus, each character
should default to the kind of line breaking that was most likely
expected of it in its natural context.

UAX 14 names three principal styles to determine line break
opportunities in different scripts:

- Western: spaces and hyphens are used to determine breaks
- East Asian: lines can break anywhere, unless prohibited
- South East Asian: line breaks require morphological analysis
<http://www.unicode.org/reports/tr14/tr14-20.html#BreakOpportunities>

According to UAX 14, the Western and East Asian styles can be unified
into a single set of specifications, whereas the South East Asian
style requires more complicated, language-dependent hyphenation
algorithms. Although, I suppose, the unified specification alone was
not enough to fully cater for the needs of any language, it should be
good enough for most cases in Western and East Asian languages. The
default behavior of each character could be redefined and refined at
the language-dependent level when necessary, but this should be
treated as a separate issue, since the language of a document was not
always easy to identify.

I'll concentrate on discussing the properties of the Western and
especially Latin scripts, since the Asian scripts are beyond my area
of expertise. I recognize that some compromises may be necessary in
order to make the line breaking system adequate for both the Western
and East Asian users, but I think we have to start by considering the
basis of each tradition independently.


CONVENTIONAL LINE BREAKS IN LATIN SCRIPTS

In Latin scripts, line break opportunities are basically marked with
spaces. Additional break opportunities may be marked with hyphens or
dashes. Breaking in any other place would generally be unconventional
and potentially confusing.

Technically, a break may usually occur only _after_ a character. In
some languages, a break may be allowed even before an em-dash, but
since this would be unexpected in other language contexts, it should
be defined as a language-dependent exception.

There are some special cases where a line break is not desirable even
after a space, a hyphen or a dash. However, in most everyday cases the
exceptions should be reasonably simple to specify:

A line break is allowed after a space, a hyphen or a dash, unless

(a) the space or hyphen is of the non-breaking type (reasoning: the
very idea of a non-breaking character is to prohibit line break)

(b) the hyphen or dash is adjacent to a space (reasoning: the basic
function of a space is to separate two words from each other, so it
seems apparent that a hyphen or dash _preceded_ by a space -- as in
the expression "suffix -ed" -- is supposed to be a fixed part of the
word it is directly connected to)

(c) the hyphen or dash is adjacent to any punctuation (reasoning:
combining a hyphen with other punctuation may imply many different
kinds of ordinary or exceptional usage -- such as ASCII art -- where
it is not desirable to break; however, since two or three adjacent
hyphens were often used as a substitute for a single dash, a double
hyphen might be considered equivalent to an en-dash and a triple
hyphen equivalent to an em-dash, generally allowing a line break after
the last hyphen)

(d) there is no more than one alphabetic or symbol character on either
side of the hyphen or dash (this would improve the typographical
appearence by preventing widowed and orphaned characters at the start
or end of a line; one might even consider preventing line breaks if
there were no more than _two_ characters on either side, or allowing
the user to define the best setting in the browser preferences).

These minimal line breaking rules should cover the most important
cases at least
for Latin scripts (although I probably overlooked something, please
feel
free to append the list).

A somewhat more detailed set of rules may be needed for numerical
contexts, where a hyphen (or sometimes perhaps a dash) is often used
as a minus sign. Note that disallowing line breaks altogether adjacent
to a numeric character would not produce a desired effect for example
in long chemical names, such as "2-bromo-4,4-dichlorophenol".

Further exceptions could be specified at the language-dependent level,
or by special "emergency break" rules for very long strings.


Language-dependent additions

Although language-dependent rules go beyond the scope of this
discussion, it might be illustrative to consider briefly how the
generic rules were appendable. As long as the document defined the
language(s) used, it should be fairly easy to apply language-dependent
additional rules, for example:

- in English, a line break is allowed both before and after an em-
dash, and irrespective of how many alphabetic or numeric characters
there are on either side

- in French, a line break is not allowed after a space if it is
followed by an exclamation mark, a question mark, a colon, a semicolon
or a closing guillemet, nor if it is preceded by an opening guillemet
(as it is conventional to separate these characters with a space in
French typography)

- in Finnish, a line break is not allowed after a space if it is
preceded by a hyphen (as there may occur cases such as "koulu- ja
kirjastorakennus" -- referring to a combined school and library
building -- and the combination of the words "koulu-" and "ja" should
not be confused with the plural partitive form "kouluja" -- schools --
which could be hyphenated as "koulu-ja").

Of course one can come by many more language-dependent rules, but they
can be added little by little, as native speakers start to point out
deficiencies. However, one should consider very carefully the positive
and negative effects and the necessity of each additional exception.
For example, in the French and Finnish examples above, the undesired
breaks can usually be prevented with a no-break space, so basically no
special rules should be needed. On the other hand, writing a Unicode
character or an HTML entity is often clumsy, and the result can be
unpredictable (for example, just a couple of days ago I tried to use
some HTML entities when commenting to a blog, but the entity codes
ended up showing as regular text), so a plain space may be a safer
choice after all.

Perhaps one day, the rules may be appended to include even language-
specific hyphenation algorithms, but for now, I suppose that's
something we can only dream of.


Non-natural languages

Non-natural languages may require special consideration, but
basically, they should follow the conventions of natural languages. In
a technical notation, such as a URL or a sequence of programming
language code, an unconventional line break may actually be even more
confusing than in a natural language sentence. Natural languages
usually contain a lot of redundancy, in order to make sure that
occasional errors or distractions will not distort the whole message.
Non-natural languages, however, usually strive for efficiency and
depend on the data to be interpreted exactly as it is written. Thus,
it may be crucial to know whether there is a space between two
characters or not, but an unconventional line break would hide this
essential detail.


Misunderstanding UAX 14

Unfortunately, UAX 14 tends to obscure the basic line breaking
principles for Latin scripts by describing the behavior of various
characters in a very complicated way. It is easy to misunderstand UAX
14. For example, I was stunned when I read (in the third section of
Table 1)* that closing punctuation -- such as ')' -- prohibits line
breaks before, and that opening punctuation -- such as '(' --
prohibits line breaks after.
*<http://www.unicode.org/reports/tr14/tr14-20.html#Table1>

Since line breaks were not prohibited _after_ a closing parenthesis
and _before_ an opening parenthesis, this seemed to imply that they
should be allowed. However, it would be absurd to break as in the
following examples:

colo(u)ring

colo(u)
ring

colo
(u)ring

After some reasoning, and with the help of the explanations found in
the (rather long) Chapter 5.1, I realized that the idea is merely to
overrule the default behavior of the nearest enclosed character (which
in my examples is "u"), in the case that _it_ allows a line break
before or after. These rules do not speak anything about how to break
_outside_ the parentheses, but only how to not break _inside_ them.

Perhaps it is exactly the confusing description in UAX 14 that has
tricked even the IE designers to allow line breaks before and after
parentheses (as well as in many other strange situations), regardless
of whether there are spaces involved or not. This is definitely not
correct in a Latin context (whereas in an East Asian context it may
actually be preferable).


LINE BREAKING AT A SLASH

According to the conventional principles of Latin scripts, a slash
would not be considered to offer a line break opportunity. Actually, a
slash is rather rare in natural language contexts, but there are
special expressions that depend on the presupposition that a word
cannot be broken at a slash (for example, abbreviations "c/o" and "s/
he" would become more difficult to perceive if they were broken).

The typographical line breaking conventions have been developed over a
period of centuries, long before there were computers and URLs to
worry about. Neither, it seems, were file-paths and URLs designed to
take into account the typographical issue of how they should be
presented in a horizontally limited space. Thus, as computers and the
Web have become an important means of communication in our everyday
life, it seems that some modifications to the conventional line
breaking rules are needed.

When analyzing the structure of a file-path, the most logical line
break opportunity seems to be either immediately after or immediately
before a slash. However, allowing line breaks indiscriminately at any
slash would produce new problems. Thus, break opportunities should be
limited to the special cases where they were considered really
necessary, i.e., long file-paths and URLs.

Perhaps the most straightforward way to identify breakable file-paths
would be to count how many slashes there were in each string, since in
natural language expressions there was rarely more than one slash.
Even if there are two slashes in a file-path, the string as a whole is
often so short that breaking it does not offer any significant
typographical improvement. For example, it would be pointless to break
a file-path such as "/etc/apt". Therefore, it might be considered
reasonable to disallow breaks unless there were at least three slashes
in a string.

Even when there were three or more slashes and the string was broken,
the reader should be given a hint that something exceptional happened
and that the broken string was actually supposed to be interpreted as
a single, continuous entity. Therefore, a break should not be allowed
after the first slash. Seeing that there was no space after the first
slash should give the reader a hint that perhaps there were no spaces
after the other slashes either (although this would be deceiving in
file-paths and URLs that _ended_ with a slash).

Furthermore, if the last part of the string is also a regular word in
the context language (as "apt" is a word in English), it may not
always be clear whether the part separated by a line break belongs to
the string or to the context. Therefore, a break should not be allowed
after the last slash (nor after the first), but any other slash might
be considered to offer a break opportunity:

/etc/
foobar/apt

This way, the presence of slashes on both lines would give the reader
a hint that the parts did perhaps belong to the same string even
though they were separated by an unconventional line break.

However, even this solution leaves room for potential confusion.
Sometimes a word is wrapped in slashes as if in parentheses or quotes
-- like /this/ -- in order to simulate the appearance of italics.
Furthermore, according to the International Phonetic Alphabet, slashes
may be used in a similar fashion in order to describe the actual
pronunciation of a word. Thus, there may occur cases such as:

(1) /foobar/ and/or

Now, consider the following file-path:

(2) /foobar/and/or

If broken before "and", both examples will look exactly the same:

/foobar/
and/or

In the first example, a line break after the space that precedes "and"
would be perfectly conventional. In the second example, a line-break
after the second slash would be unconventional and a potential cause
for confusion.

Therefore, a possibly better solution (as suggested above by David E.
Ross) might be to allow breaks only _before_ a slash. In that case,
the second example could be broken in two ways:

/foobar
/and/or

/foobar/and
/or

This should prevent anybody from confusing a file-path to the special
usage of simulating italics or marking pronunciation with slashes.
Also, seeing a line beginning with a slash would warn the reader that
there was something exceptional in the string, and if there were
slashes even on the previous line, it shouldn't be too hard to
conclude that the strings were somehow linked to each other.


CONCLUSIONS

The default behavior for most Latin characters is to not allow line
breaks either before or after, whereas it seems that for most East
Asian characters the default behavior is to allow line breaks both
before and after. Obviously, if a Latin character is put adjacent to
an East Asian character, their default behaviors conflict. There
should be a consistent rule on how the conflict is solved.

Since restricting the line breaks appears to be a significant problem
in East Asian languages, perhaps it would be reasonable to allow an
East Asian character to overrule the default breaking behavior of a
Latin character if put adjacent to each other. However, if put into a
Latin context, even a non-Latin character should rather be treated as
a symbol character inherent in Latin scripts (and thus, line breaking
would not be allowed), but this can be specified at the language-
dependent level.

This approach would not solve the problem that East Asian users
expected even words written with Latin characters to break at any
punctuation, but I'm afraid that this issue cannot be helped without
violating the fundamental logic of Latin scripts.

I have tried to illustrate the conditions of line breaking in Latin
scripts and the potential problems caused by overlooking and adding
exceptions to the conventional rules. Each exception should be
considered very carefully because a relative improvement in
typographical appearance can hardly be justified if the required
adjustments can distort the actual message. The basic function of the
art of typography is to make it easier for the reader to absorb
information. If typographical solutions make the contents more
difficult to understand, it is bad typography.

_______________________________________________
dev-tech-layout mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-tech-layout

Re: Line Breaking

Reply via email to