On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <josh.el...@gmail.com> wrote: > David, I beg to differ. > > Setting it via the JVM property is a single change to make, whereas if you > change every single usage of getBytes(), you now forced the next person to > branch the code, change everything to UTF16 (hypothetical use case) and > continue a diverged codebase forever.
Typically, the reason(s) that people don't take this approach are: a: a fear that other JVMs don't have this parameter, or don't have it under the same name. b: a desire to read or write files for uses in 'the platform encoding' whatever it is, in addition to whatever needs to be done in UTF-8. I'd be very surprised if Accumulo ever decided to do this sort of thing in UTF-16. > > I would say that the reason that such a JVM property exists is to alleviate > you from having to make these code changes in the first place. > > On 10/29/2012 12:00 PM, David Medinets wrote: >> >> I like the idea of making the change explicit in the source code. >> Setting the encoding in the jvm property would be easier but not as >> explicit. I have a few dozen of the files changed. Today I have free >> time since Hurricane Sandy has closed offices. >> >> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum >> <wilhelm.von.cl...@accumulo.net> wrote: >>> >>> Isn't it easier to just set the JVM property `file.encoding`? >>> >>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ekohl...@gmail.com> wrote: >>> >>>> If you use a private static field in each class for the charset, it will >>>> basically be a singleton because charsets are cached in char >>>> set.forname. >>>> IMHO this is a somewhat cleaner approach than having lots of static >>>> imports >>>> to utility classes with lots of constants in them. >>>> On Oct 28, 2012 5:50 PM, "David Medinets" <david.medin...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> >>>> >>>> https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680 >>>>> >>>>> >>>>> In this comment, John mentioned that all getBytes() method calls >>>>> should be changed to use UTF8. There are about 1,800 getBytes() calls >>>>> and not all of them involve String objects. I am working on ways to >>>>> identify a subset of these calls to change. >>>>> >>>>> I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to >>>>> track this issue. >>>>> >>>>> Should we create one static Charset object? >>>>> >>>>> Class AccumuloDefaultCharset { >>>>> public static Charset UTF8 = Charset.forName("UTF8"); >>>>> } >>>>> >>>>> Should we use a static constant? >>>>> >>>>> public static String UTF8 = "UTF8"; >>>>> >>>>> I have found one instance of getBytes() in InputFormatBase: >>>>> >>>>> protected static byte[] getPassword(Configuration conf) { >>>>> return Base64.decodeBase64(conf.get(PASSWORD, "").getBytes()); >>>>> } >>>>> >>>>> Are there any reasons why I can't start specifying the charset? Is >>>>> UTF8 the right Charset to use? I am not an expert in non-English >>>>> charsets, so guidance would be welcome. >>>>> >>>> >