ctubbsii commented on issue #88: URL: https://github.com/apache/accumulo-access/issues/88#issuecomment-3644093932
I think these are the actionable items: 1. Be more specific in the EBNF to clarify what kinds of characters are valid (some subset of [Unicode character categories](https://www.unicode.org/reports/tr44/#General_Category_Values)). 1. EBNF defines characters, not bytes 2. Characters *MUST* be valid Unicode codepoints 3. Characters *SHOULD* be human-readable (noncharacters, reserved codepoints, and control characters are not recommended, but technically still allowable, because it'd be hard to check for them and I'm not sure it's worth it) 4. UTF-8 *SHOULD* be used for persistence 2. Deprecate and phase out APIs that allow inputting Authorizations and ColumnVisibility as bytes 1. Use String or CharSequence to ensure that what is specified is a sequence of Unicode characters 2. Add validation to deprecated APIs that still accept bytes 3. Check on upgrade to see if any existing authorizations contain disallowed characters (unlikely, but not a bad thing to check for) 3. When reading existing data, decode persisted bytes using UTF-8 decoder, and treat decoding errors as invalid visibility expression, in the same way that mismatched parentheses would be treated like an invalid visibility expression. I would be interested in a second beta release of accumulo-access whose API was String-centric, rather than one that allowed arbitrary bytes. I understand that there may be issues with that, but I'd be curious to see if we could make it work. I think at its core, the main issue here is that we assume people are going to use Unicode characters, but we allow raw bytes, which can break our assumptions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
