Re: Setting Charset in getBytes() call.

Benson Margulies Mon, 29 Oct 2012 09:24:41 -0700

On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <josh.el...@gmail.com> wrote:
> David, I beg to differ.
>
> Setting it via the JVM property is a single change to make, whereas if you
> change every single usage of getBytes(), you now forced the next person to
> branch the code, change everything to UTF16 (hypothetical use case) and
> continue a diverged codebase forever.


Typically, the reason(s) that people don't take this approach are:

a: a fear that other JVMs don't have this parameter, or don't have it
under the same name.
b: a desire to read or write files for uses in 'the platform encoding'
whatever it is, in addition to whatever needs to be done in UTF-8.

I'd be very surprised if Accumulo ever decided to do this sort of
thing in UTF-16.


>
> I would say that the reason that such a JVM property exists is to alleviate
> you from having to make these code changes in the first place.
>
> On 10/29/2012 12:00 PM, David Medinets wrote:
>>
>> I like the idea of making the change explicit in the source code.
>> Setting the encoding in the jvm property would be easier but not as
>> explicit. I have a few dozen of the files changed. Today I have free
>> time since Hurricane Sandy has closed offices.
>>
>> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum
>> <wilhelm.von.cl...@accumulo.net> wrote:
>>>
>>> Isn't it easier to just set the JVM property `file.encoding`?
>>>
>>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ekohl...@gmail.com> wrote:
>>>
>>>> If you use a private static field in each class for the charset, it will
>>>> basically be a singleton because charsets are cached in char
>>>> set.forname.
>>>> IMHO this is a somewhat cleaner approach than having lots of static
>>>> imports
>>>> to utility classes with lots of constants in them.
>>>> On Oct 28, 2012 5:50 PM, "David Medinets" <david.medin...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>
>>>> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
>>>>>
>>>>>
>>>>> In this comment, John mentioned that all getBytes() method calls
>>>>> should be changed to use UTF8. There are about 1,800 getBytes() calls
>>>>> and not all of them involve String objects. I am working on ways to
>>>>> identify a subset of these calls to change.
>>>>>
>>>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
>>>>> track this issue.
>>>>>
>>>>> Should we create one static Charset object?
>>>>>
>>>>>    Class AccumuloDefaultCharset {
>>>>>      public static Charset UTF8 = Charset.forName("UTF8");
>>>>>    }
>>>>>
>>>>> Should we use a static constant?
>>>>>
>>>>>    public static String UTF8 = "UTF8";
>>>>>
>>>>> I have found one instance of getBytes() in InputFormatBase:
>>>>>
>>>>>    protected static byte[] getPassword(Configuration conf) {
>>>>>      return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes());
>>>>>    }
>>>>>
>>>>> Are there any reasons why I can't start specifying the charset? Is
>>>>> UTF8 the right Charset to use? I am not an expert in non-English
>>>>> charsets, so guidance would be welcome.
>>>>>
>>>>
>

Re: Setting Charset in getBytes() call.

Reply via email to