[ 
https://issues.apache.org/jira/browse/OAK-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527901#comment-15527901
 ] 

Alexander Klimetschek edited comment on OAK-4857 at 9/28/16 12:49 AM:
----------------------------------------------------------------------

Basically:
* the entire [space separator "Zs" 
category|http://www.fileformat.info/info/unicode/category/Zs/list.htm], with 
the exception of the "normal" {{u20}} space, is not allowed in the middle, but 
at start or end of a node name (these are {{Character.isSpaceChar}} in Java)
* for regular {{u20}} spaces this is reversed, these are not allowed at the 
beginning or end, while allowed in the middle (there is an [extra 
check|https://github.com/apache/jackrabbit-oak/blob/2f85bd78b53851fd00d49695712da3094baeb59e/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/name/Namespaces.java#L260]
 just for this space)
* there is a third category, non breaking spaces ({{ua0}}, {{u2007}} and 
{{u202f}}), which are allowed everywhere (these are {{isSpaceChar}} but not 
{{isWhitespace}})

"True" whitespace such as tabs or newlines are not allowed anywhere, since 
OAK-3412.

Note the {{oak.allowOtherWhitespaceChars}} setting introduced in OAK-3412 does 
not make a difference, setting it to true gives the pre-1.4 behavior, which 
actually allowed _whitespace_ such as newlines everywhere, while it still 
prevents all Zs spaces.

For reference, Jackrabbit 2 
[seems|https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-spi-commons/src/main/java/org/apache/jackrabbit/spi/commons/conversion/PathParser.java#L257]
 to have the same behavior as Oak post OAK-3412.

See also [this similar 
comment|https://issues.apache.org/jira/browse/OAK-3412?focusedCommentId=14991336&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14991336]
 by [~mevinay].

Technically, things can be slightly confusing in Java with the meaning of 
[Character.isSpaceChar()|https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isSpaceChar(char)]
 vs. 
[Character.isWhitespace()|https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(char)].
 The latter includes the former for the most part, but with the 3 non breaking 
spaces as an exception, and then adds all the newline etc. whitespace chars. I 
would be arguing for treating all characters of {{Character.isSpaceChar()}} 
like the normal {{u20}} space.


was (Author: alexander.klimetschek):
To be exact, the entire [space separator "Zs" 
category|http://www.fileformat.info/info/unicode/category/Zs/list.htm], with 
the exception of the "normal" {{u20}} space, is affected. For regular {{u20}} 
spaces this is reversed, these are not allowed at the beginning or end, while 
allowed in the middle. Whitespace such as tabs or newlines are not allowed 
anywhere, since OAK-3412.

Note the {{oak.allowOtherWhitespaceChars}} setting introduced in OAK-3412 does 
not make a difference, setting it to true gives the pre-1.4 behavior, which 
actually allowed _whitespace_ such as newlines everywhere, while it still 
prevents all Zs spaces.

For reference, Jackrabbit 2 
[seems|https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-spi-commons/src/main/java/org/apache/jackrabbit/spi/commons/conversion/PathParser.java#L257]
 to have the same behavior as Oak post OAK-3412.

See also [this similar 
comment|https://issues.apache.org/jira/browse/OAK-3412?focusedCommentId=14991336&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14991336]
 by [~mevinay].

Technically, things can be slightly confusing in Java with the meaning of 
[Character.isSpaceChar()|https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isSpaceChar(char)]
 vs. 
[Character.isWhitespace()|https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(char)].
 The latter includes the former for the most part, but with a few exceptions, 
and then adds all the newline etc. whitespace chars. I would be arguing for 
treating all characters of {{Character.isSpaceChar()}} like the normal {{u20}} 
space.

> Support space chars common in CJK inside node names
> ---------------------------------------------------
>
>                 Key: OAK-4857
>                 URL: https://issues.apache.org/jira/browse/OAK-4857
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.4.7, 1.5.10
>            Reporter: Alexander Klimetschek
>         Attachments: OAK-4857-tests.patch
>
>
> Oak does not allow spaces commonly used in CJK like {{u3000}} (ideographic 
> space) or {{u00A0}} (no-break space) _inside_ a node name, while allowing 
> them at the _beginning or end_.
> They should be supported for better globalization readiness, and filesystems 
> allow them, making common filesystem to JCR mappings unnecessarily hard. 
> Escaping would be an option for applications, but there is currently no 
> utility method for it 
> ([Text.escapeIllegalJcrChars|https://jackrabbit.apache.org/api/2.8/org/apache/jackrabbit/util/Text.html#escapeIllegalJcrChars(java.lang.String)]
>  will not escape these spaces), nor is it documented for applications how to 
> do so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to