[
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-3255:
------------------------------------
Status: Open (was: Patch Available)
Had a chat with Koji. He pointed out HADOOP-6109 which doubles the size of
byte[] in Text every time a append happens.
Text.java
Hadoop 1.x
{code}
private void setCapacity(int len, boolean keepData) {
if (bytes == null || bytes.length < len) {
byte[] newBytes = new byte[len];
if (bytes != null && keepData) {
System.arraycopy(bytes, 0, newBytes, 0, length);
}
bytes = newBytes;
}
}
{code}
Hadoop 0.23/2.x:
{code}
private void setCapacity(int len, boolean keepData) {
if (bytes == null || bytes.length < len) {
if (bytes != null && keepData) {
bytes = Arrays.copyOf(bytes, Math.max(len,length << 1));
} else {
bytes = new byte[len];
}
}
}
{code}
So value.getBytes().length == value.getLength() will be true only when the size
of the line is < io.file.buffer.size. Since a copy of the byte[] needs to be
created with the right size in any case, we can go with reusing the Text() for
every getNext() in OutputHandler. It will be more beneficial when the record
sizes are greater than io.file.buffer.size and value.getBytes().length is
almost never equal to value.getLength() because of the doubling of the size.
I will modify the patch to reuse Text object.
> Avoid extra byte array copy in streaming deserialize
> ----------------------------------------------------
>
> Key: PIG-3255
> URL: https://issues.apache.org/jira/browse/PIG-3255
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.11
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3255-1.patch
>
>
> PigStreaming.java:
> public Tuple deserialize(byte[] bytes) throws IOException {
> Text val = new Text(bytes);
> return StorageUtil.textToTuple(val, fieldDel);
> }
> Should remove new Text(bytes) copy and construct the tuple directly from the
> bytes
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira