[jira] [Commented] (TEZ-1937) Reduce cost of merging ifiles in UnorderedPartitionedWriter

Siddharth Seth (JIRA) Mon, 19 Jan 2015 19:24:13 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283355#comment-14283355
 ]


Siddharth Seth commented on TEZ-1937:
-------------------------------------

The counter should consider compression - since it's measuring bytes read from 
disk. It'll be better to update it in the IFile.appendIFile method so that 
whenever this is changed to fix compression, it'll be an obvious fix.

{code}
+        } else {
+          LOG.warn("Could not obtain decompressor from CodecPool");
+          in = checksumIn;
+        }
{code}
Should be an exception.

{code}
+        prevKey = null;
+        previous.reset();
{code}
Why is this required ?

Doesn't each IFile stream (per partition in each spill file) also have a 
checksum associated with it. I believe using partLength will not copy the 
checksum - but is a new checksum being computed for the entire partition stream 
in the writer ?

Any corner cases where the same record exists across two files - with RLE break 
in any way. I don't think it should.

> Reduce cost of merging ifiles in UnorderedPartitionedWriter
> -----------------------------------------------------------
>
>                 Key: TEZ-1937
>                 URL: https://issues.apache.org/jira/browse/TEZ-1937
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1937.1.patch, TEZ-1937.2.patch, TEZ-1937.WIP.patch
>
>
> Currently we iterate through all spilled files for merging.  This incurs 
> additional deserialization cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1937) Reduce cost of merging ifiles in UnorderedPartitionedWriter

Reply via email to