willmurnane commented on issue #88: URL: https://github.com/apache/accumulo-access/issues/88#issuecomment-3578873667
It looks to me like the access expression parser `ColumnVisibilityParser` in Accumulo still assumes that the byte sequences it's parsing represent valid UTF-8. I think the best way to handle this in the specification is not to worry about the definition of how UTF-8 codepoints are represented in bytes; that's [easy to find](https://en.wikipedia.org/wiki/UTF-8#Description) and implemented in many languages' standard library/string type. A simpler description means there's less to read, less to implement, and probably better correctness in other implementations. If an additional mode is under consideration, I think a Unicode-normalizing mode would be a good target. The rules for doing this type of normalization are complicated, but are often implemented as part of the standard library or an easy to find package ([Java](https://docs.oracle.com/javase/8/docs/api/java/text/Normalizer.html#normalize-java.lang.CharSequence-java.text.Normalizer.Form-), [Rust](https://docs.rs/unicode-normalization/latest/unicode_normalization/trait.UnicodeNormalization.html#tymethod.nfc), [Python](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize), etc). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
