Sounds like we should grep through the codebase and make sure the only charset we're using is UTF-8... 10
On Sun, May 5, 2013 at 8:08 PM, Christopher <[email protected]> wrote: > The shell should accept java "String" from the the console (leaving > the job of converting input bytes to a java String argument to the > locale-dependent console), and should only translate them to UTF-8 > when it sends it to Accumulo, I think. > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Sun, May 5, 2013 at 6:49 PM, Drew Farris <[email protected]> wrote: > > In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and > > getEndRow, use the following snippet to read their arguments: > > > > new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET)); > > > > Here, Shell.CHARSET is set to ISO-8859-1 > > > > This seems to mean that if I use UTF-8 characters (unescaped) from the > > shell to set my begin or end row, that I will not get what I expect > because > > the conversion from String to bytes would be performed using the > incorrect > > character set. > > > > For example, in the following snippet, testIso fails while testUTF > succeeds > > (when the encoding of the source file is UTF-8): > > > > > > @Test > > > > public void testISO() throws Exception { > > > > String s = "本条目是介紹"; > > > > String charset = "ISO-8859-1"; > > > > Text t = new Text(s.getBytes(charset)); > > > > Assert.assertEquals(s, t.toString()); > > > > } > > > > > > @Test > > > > public void testUTF() throws Exception { > > > > String s = "本条目是介紹"; > > > > String charset = "UTF-8"; > > > > Text t = new Text(s.getBytes(charset)); > > > > Assert.assertEquals(s, t.toString()); > > > > } > > > > > > Possibly this should be locale dependent behavior? Also, perhaps I'm > > missing the fact that the Shell is not supposed to support UTF-8 > characters > > in start and end ranges, and users must escape their strings > appropriately. > > (Which would be a bit of a pain). > > > > > > - Drew >
