What Mark Davis said; also, depending on what you need, consider taking a look at the definitions used by Unicode regexpes, at http://unicode.org/reports/tr18/ .
2016-08-04 16:37 GMT-03:00 Sean Leonard <lists+unic...@seantek.com>: > Hi Unicode Folks: > > I am trying to come up with a sensible sets of characters that are > considered whitespace or newlines in Unicode, and to understand the > relative stability policy with respect to them. (This is for a formal > syntax where the definition of "whitespace" matters, e.g., to separate > identifiers, and I want to be as conservative as possible.) Please let me > know if the stuff below is correct, or needs work. > > The following characters / sequences are considered line breaking > characters, per UAX #14 and Section 5.8 of UNICODE: > > CRLF CR LF FF VT NEL LS PS > > So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination > U+000D U+000A (treated as one line break). These characters / sequences are > called "newlines". > > There will not be any additional code points that are assigned to be line > breaks. (Correct?) > > CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. > These are distinguished from other codes (above) that also mean line > breaks, mainly because of historical and widespread use of them. > > There are several formatting characters that affect word wrapping and line > breaking, as discussed in those documents...but they are not line breaking > characters. > > **** > > The following characters are whitespaces: characters (code points) with > the property WSpace=Y (or White_Space). This is: > > newlines > U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 > > Assigned characters that are not listed above, can never be whitespace > (according to Unicode). However, the set is not closed, so unassigned code > points *could* be assigned to whitespace. It is (unlikely? very unlikely? > Pretty much never going to happen?) that additional code points will be > assigned to whitespace. > > **** > > There are some other characters that Unicode does not consider whitespace, > but deserve discussion: > U+180E MONGOLIAN VOWEL SEPARATOR: <https://codeblog.jonskeet.uk/ > 2014/12/01/when-is-an-identifier-not-an-identifier- > attack-of-the-mongolian-vowel-separator/> > <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/> > U+200B ZERO WIDTH SPACE > U+200C ZERO WIDTH NON-JOINER > U+200D ZERO WIDTH JOINER > U+200E LEFT-TO-RIGHT MARK* > U+200F RIGHT-TO-LEFT MARK* > U+2060 WORD JOINER > U+FEFF ZERO WIDTH NON-BREAKING SPACE > > *These appear in Pattern_White_Space, but Pattern_White_Space excludes > U+2000-200A characters, which are obviously spaces. This is confusing and I > would appreciate clarification *why* Pattern_White_Space is significantly > disjoint from White_Space. > > ******** > The borderline characters above are not considered WSpace=Y, but sometimes > might have space-like properties. ZWP and ZWNBP are obviously "space" > characters, but they never generate whitespace. I suppose that conversely > LTRM and RTLM are obviously "not space" characters, but they could generate > whitespace under certain circumstances. Ditto for other formatting > characters in general (for which the class is much larger). > > Therefore I guess a Unicode definition of "whitespace" (or "space > characters") is: an assigned code point that *always* (is supposed to) > generates white space (empty space between graphemes). > > ******** > > Are there other standards that Unicode people recommend, that have > addressed whether certain borderline characters are considered whitespace > vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax > component)? > > Regards, > > Sean >