On 02/20/2015 04:56 PM, Philippe Verdy wrote:
2015-02-20 6:14 GMT+01:00 Richard Wordingham
<[email protected] <mailto:[email protected]>>:

    TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8.
    One thing that is missing is mention of the convention that a single
    newline character (or CRLF pair) is a line break whereas a doubled
    newline character denotes a paragraph break.


In that case CR or LF characters alone are not "paragraph separators" by
themselves unless they are grouped together. Like NEL, they should just
be considered as line separators and the terminology used in UAX 29 rule
SB4 is effectively incorrect if what matters here is just the linebreak
property. And also in that case, the SB4 rule should effecticely include
NEL (from the C1 subset).

But as SB4 is only related to sentence breaking, It would be e problem
because simple linebreaks are used extremely frequently in the middle of
sentences.

What the Sentence break algorithm should say is that there should first
be a preprossing step separating line breaks and paragraph breaks
(creating custom entities,(similar to collation elements, but encoded
internally with a code point out of the standard space), that the rule
SB4 would use instead of "Sep | CR | LF". That custome entity should be
"Sep" but without the rule defining it, as there are various ways to
represent paragraph breaks.


But isn't SB4 contradictory to this from TUS Section 5.8?

R2c In parsing, choose the
safest interpretation.
For example, in recommendation R2c an implementer dealing with sentence break heuris-
tics would reason in the following way that it is safer to interpret any
NLF
as LS:
• Suppose an
NLF
were interpreted as LS, when it was meant to be PS. Because
most paragraphs are terminated with punctuation anyway, this would cause
misidentification of sentence boundaries in only a few cases.
• Suppose an
NLF
were interpreted as PS, when it was meant to be LS. In this
case, line breaks would cause sentence br
eaks, which would result in significant
problems with the sentence break heuristics

It seems to me SB4 is choosing the non-safer way.  What am I missing?

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Reply via email to