Anytime that I've encountered non-English character sets, the answer has been to use UTF-8. I'm moving forward with that assumption since it is safe change. If the group decides to use a different default encoding, it will be trivial to build on the work that I've done identifying getBytes() calls. I will post a list of files and my methodology before a svn checkin.
On Mon, Oct 29, 2012 at 4:02 PM, Benson Margulies <[email protected]> wrote: > On Mon, Oct 29, 2012 at 3:18 PM, John Vines <[email protected]> wrote: >> So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any >> reason to use something beside ISO-8859-1 for the encodings? > > I object and caution against *any* plan that involves transcoding from > X to UTF-16 and back where when the data is not always going to be > valid bytes of encoding X. The only clean solution here is to have an > API entirely in terms of bytes, and either let the user do getBytes if > they want to store string data, or provide additional API. > > > >> >> John >> >> On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <[email protected]> wrote: >> >>> > UTF-8 should always be present (according to the JLS), and as a >>> multi-byte >>> > format should be able to encode any character that you would need to. >>> > >>> >>> UTF-8 cannot encode arbitrary data. All data that we store in accumulo >>> is not characters. A safe encoding to use as a pass through when you >>> don't know if you are dealing with characters is ISO-8859-1 since we know >>> that we can make the round trip from bytes to string to bytes without loss. >>>
