On Fri, Nov 2, 2012 at 3:56 PM, Benson Margulies <[email protected]> wrote: > Maybe I'm being particularly dense, but I still think that this is > being made too complex by failing to enumerate the specific goals.
I agree that there has been a failure to enumerate specific goals with regard to encoding. I made an attempt to identify potential goals (scopes), for which encoding matters on this ticket: https://issues.apache.org/jira/browse/ACCUMULO-840 In there, I identify two considerations: 1) API issues addressing consistency for user data (eg. passwords, table names, Mutation constructors that take Strings), and 2) INTERNAL issues related to Accumulo storing and reading state that persists or is communicated between its operating components (a clear example of this is how we store the !METADATA column family names, which start out as Java String literals, and get encoded to bytes by the time it gets stored in the table). I think #1 can be addressed by simply waiting until somebody presents a feature request with a use case, and in the meantime, we simply don't touch it. I think #2 can be addressed by establishing an internal policy (along the lines of our codestyle standards) that establishes that Accumulo will consistently store String data for its internal use as UTF8 when we have to store that String as bytes, and when we convert such bytes into Strings, we do so under the assumption it is UTF8. If we can agree to this policy, anything that is actually non-compliant (i.e. where there's a possibility it won't be stored or read as UTF8) will simply be a bug that we apply a very narrowly scoped bugfix to ensure consistency with the policy. I think David has already identified some such cases and attempted to fix them in the process of working on ACCUMULO-836. I think those are fine, but they need to be checked to ensure that when they are converted back to a String, they are read as UTF8. However, it might be better if these changes were split into separate bugs, because even though they are all the same class of bug, they apply to separate components (eg. "Potential bug - Inconsistent encoding with Zookeeper data", "Potential bug - Inconsistent encoding with mapreduce configuration", etc.). These bugs can be identified and fixed as we encounter them, rather than as an attempt to fix the entire code base. We shouldn't have to spend a lot of time on them... we should do the simple thing first: establish the policy. -- Christopher L Tubbs II http://gravatar.com/ctubbsii
