2011/10/10 Eli Zaretskii <e...@gnu.org>: >> what's the meaning of 'appropriate Newline Functions' and 'higher-level >> protocol paragraph determination'? > > Newline Function (NLF) is described in Section 5.8 of Unicode. > Higher-level protocols are described in section 4.3 of UAX#9. In a > nutshell, your application can have its own ideas of what begins and > what ends a paragraph, and you are allowed to use those rules instead > of what P3 says.
For me I interpret the sentence as including all other non-plain text mechanisms available in various file formats or interchange protocols, such as HTML. But even with HTML (and XML as well), you also have to consider the case of the behavior of whitespaces: all those newlines or whitespaces are collapsed by default, unless there's an XML attribute (in XHTML) saying the opposite, or a default style associated to some HTML elements (for example "pre" elements, where whitespace:collpase is not the default). Add to this the additional protocol implied by CSS (that allowschanging the whitespace behavior by a stylesheet), and then the classification of whitespaces cannot be resolved at the encoded doucment level, but only after it has been parsed, and even been contextually styled (and this behavior can even be changed dynamically). In other words: the rich-text protocol applies its own interpretation first, and then exposes the document within its internal temporary state, through which the separation of paragraphs (or "blocks") are separated from "inline" elements and plain-text elements. The Unicode algorithms will then apply only to the many small fragments of that are only a part of the document. In many cases, in those formats, you will never see any newline or paragraph separator in those plain-text elements or plain-text attribute values. Instead, you will have to compose with the other out-of-band information exposed by the dynamic DOM, on which the Unicode standard cannot fix a standard, but just some guidelines. There are other specificities that are not representable as plain-text (for example: the "<br/>" element" does not convert exactly to any newline or paragraph separator, because the rich-text document has a more complex structure, where blocks are self-embeddable to other larger blocks, and you cannot clearly indicate within plain-text how any newline or paragraph separator restructures the document, such as the block embedding level, which a conversion from rich-text to plain-text will loose completely). With the richer set of HTML5 "semantic elements", this is even more evident: the interpretation can be fully specified, but does not even fix any presentation (which is still fully stylable, so that all elements may be visually reordered and repositioned on the rendered page, or contextually hidden or reorganized according to user preferences, or selectively displayed and used when reimporting parts of the document into another one). My opinion is that the Unicode standard should avoid adding constraints on those rich-text formats. It should only focus on the content of plain-text elements, if they are exposed by a mechanism like the DOM in XML and HTML, and those standards do not define any other specific behavior. The TUS should only be there to define the default interpretation and nothing else (if it says something it should just be informational, to help maintain some level of limited interoperability, but not normative as there will be lots of reasonnable exceptions).