willmurnane commented on issue #88:
URL: https://github.com/apache/accumulo-access/issues/88#issuecomment-3578873667

   It looks to me like the access expression parser `ColumnVisibilityParser` in 
Accumulo still assumes that the byte sequences it's parsing represent valid 
UTF-8. I think the best way to handle this in the specification is not to worry 
about the definition of how UTF-8 codepoints are represented in bytes; that's 
[easy to find](https://en.wikipedia.org/wiki/UTF-8#Description) and implemented 
in many languages' standard library/string type. A simpler description means 
there's less to read, less to implement, and probably better correctness in 
other implementations.
   
   If an additional mode is under consideration, I think a Unicode-normalizing 
mode would be a good target. The rules for doing this type of normalization are 
complicated, but are often implemented as part of the standard library or an 
easy to find package 
([Java](https://docs.oracle.com/javase/8/docs/api/java/text/Normalizer.html#normalize-java.lang.CharSequence-java.text.Normalizer.Form-),
 
[Rust](https://docs.rs/unicode-normalization/latest/unicode_normalization/trait.UnicodeNormalization.html#tymethod.nfc),
 
[Python](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize),
 etc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to