[ https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohini Palaniswamy updated PIG-3255: ------------------------------------ Status: Open (was: Patch Available) Had a chat with Koji. He pointed out HADOOP-6109 which doubles the size of byte[] in Text every time a append happens. Text.java Hadoop 1.x {code} private void setCapacity(int len, boolean keepData) { if (bytes == null || bytes.length < len) { byte[] newBytes = new byte[len]; if (bytes != null && keepData) { System.arraycopy(bytes, 0, newBytes, 0, length); } bytes = newBytes; } } {code} Hadoop 0.23/2.x: {code} private void setCapacity(int len, boolean keepData) { if (bytes == null || bytes.length < len) { if (bytes != null && keepData) { bytes = Arrays.copyOf(bytes, Math.max(len,length << 1)); } else { bytes = new byte[len]; } } } {code} So value.getBytes().length == value.getLength() will be true only when the size of the line is < io.file.buffer.size. Since a copy of the byte[] needs to be created with the right size in any case, we can go with reusing the Text() for every getNext() in OutputHandler. It will be more beneficial when the record sizes are greater than io.file.buffer.size and value.getBytes().length is almost never equal to value.getLength() because of the doubling of the size. I will modify the patch to reuse Text object. > Avoid extra byte array copy in streaming deserialize > ---------------------------------------------------- > > Key: PIG-3255 > URL: https://issues.apache.org/jira/browse/PIG-3255 > Project: Pig > Issue Type: Bug > Affects Versions: 0.11 > Reporter: Rohini Palaniswamy > Assignee: Rohini Palaniswamy > Fix For: 0.12 > > Attachments: PIG-3255-1.patch > > > PigStreaming.java: > public Tuple deserialize(byte[] bytes) throws IOException { > Text val = new Text(bytes); > return StorageUtil.textToTuple(val, fieldDel); > } > Should remove new Text(bytes) copy and construct the tuple directly from the > bytes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira