Re: Setting Charset in getBytes() call.

David Medinets Mon, 29 Oct 2012 13:29:34 -0700

Anytime that I've encountered non-English character sets, the answer
has been to use UTF-8. I'm moving forward with that assumption since
it is safe change. If the group decides to use a different default
encoding, it will be trivial to build on the work that I've done
identifying getBytes() calls. I will post a list of files and my
methodology before a svn checkin.


On Mon, Oct 29, 2012 at 4:02 PM, Benson Margulies <[email protected]> wrote:
> On Mon, Oct 29, 2012 at 3:18 PM, John Vines <[email protected]> wrote:
>> So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any
>> reason to use something beside ISO-8859-1 for the encodings?
>
> I object and caution against *any* plan that involves transcoding from
> X to UTF-16 and back where when the data is not always going to be
> valid bytes of encoding X. The only clean solution here is to have an
> API entirely in terms of bytes, and either let the user do getBytes if
> they want to store string data, or provide additional API.
>
>
>
>>
>> John
>>
>> On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <[email protected]> wrote:
>>
>>> > UTF-8 should always be present (according to the JLS), and as a
>>> multi-byte
>>> > format should be able to encode any character that you would need to.
>>> >
>>>
>>> UTF-8 cannot encode arbitrary data. All data that we store in accumulo
>>> is not characters. A safe encoding to use as a pass through when you
>>> don't know if you are dealing with characters is ISO-8859-1 since we know
>>> that we can make the round trip from bytes to string to bytes without loss.
>>>

Re: Setting Charset in getBytes() call.

Reply via email to