[
https://issues.apache.org/jira/browse/PIG-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pradeep Kamath updated PIG-494:
-------------------------------
Assignee: Pradeep Kamath
Status: Patch Available (was: Open)
Attached Patch - I used
{code}
String(byte[] bytes, String charsetName)
Constructs a new String by decoding the specified array of bytes
using the specified charset.
{code}
instead of using CharsetDecoder.
I had to likewise make a change in PigStorage to use
{code}
getBytes(String charsetName)
Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.
{code}
In both the above calls I use "UTF-8" as charset name.
With these changes, users of PigStorage will have to be aware that PigStorage
assumes input data to it is in UTF-8 and output from it is in UTF-8 for
chararray fields.
> Utf8StorageConverter.bytesToCharArray does not properly do utf8 conversions
> ---------------------------------------------------------------------------
>
> Key: PIG-494
> URL: https://issues.apache.org/jira/browse/PIG-494
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Reporter: Alan Gates
> Assignee: Pradeep Kamath
> Fix For: types_branch
>
>
> This function just does new String(bytes[]). It needs instead to use a
> CharsetDecoder (see BufferedPositionedInputStream.readLine in pig 1.x). This
> causes non-ascii characters to be incorrectly translated from byte arrays to
> strings.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.