Re: Setting Charset in getBytes() call.

Christopher Tubbs Fri, 02 Nov 2012 18:54:30 -0700

On Fri, Nov 2, 2012 at 3:56 PM, Benson Margulies <[email protected]> wrote:
> Maybe I'm being particularly dense, but I still think that this is
> being made too complex by failing to enumerate the specific goals.


I agree that there has been a failure to enumerate specific goals with
regard to encoding. I made an attempt to identify potential goals
(scopes), for which encoding matters on this ticket:
https://issues.apache.org/jira/browse/ACCUMULO-840

In there, I identify two considerations:
1) API issues addressing consistency for user data (eg. passwords,
table names, Mutation constructors that take Strings), and
2) INTERNAL issues related to Accumulo storing and reading state that
persists or is communicated between its operating components (a clear
example of this is how we store the !METADATA column family names,
which start out as Java String literals, and get encoded to bytes by
the time it gets stored in the table).

I think #1 can be addressed by simply waiting until somebody presents
a feature request with a use case, and in the meantime, we simply
don't touch it.

I think #2 can be addressed by establishing an internal policy (along
the lines of our codestyle standards) that establishes that Accumulo
will consistently store String data for its internal use as UTF8 when
we have to store that String as bytes, and when we convert such bytes
into Strings, we do so under the assumption it is UTF8. If we can
agree to this policy, anything that is actually non-compliant (i.e.
where there's a possibility it won't be stored or read as UTF8) will
simply be a bug that we apply a very narrowly scoped bugfix to ensure
consistency with the policy. I think David has already identified some
such cases and attempted to fix them in the process of working on
ACCUMULO-836. I think those are fine, but they need to be checked to
ensure that when they are converted back to a String, they are read as
UTF8. However, it might be better if these changes were split into
separate bugs, because even though they are all the same class of bug,
they apply to separate components (eg. "Potential bug - Inconsistent
encoding with Zookeeper data", "Potential bug - Inconsistent encoding
with mapreduce configuration", etc.). These bugs can be identified and
fixed as we encounter them, rather than as an attempt to fix the
entire code base. We shouldn't have to spend a lot of time on them...
we should do the simple thing first: establish the policy.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

Re: Setting Charset in getBytes() call.

Reply via email to