[
https://issues.apache.org/jira/browse/TEZ-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053870#comment-14053870
]
Rohini Palaniswamy commented on TEZ-1228:
-----------------------------------------
Thanks [~gopalv] and [~rajesh.balamohan]. This file format design addresses
what I have been asking for Pig. I did dig up the old mail I sent and realized
that I had also proposed changing KeyValueWriters to be KeyValuesWriter. i.e
also support writing key,list<value> and or key,iterator<value> in addition to
writing key, value so that whenever possible Pig can write a list of values
instead of unwrapping the list and writing as key,value. That could be a
followup enhancement jira. But it would be nice if you keep that in mind when
making changes to the sorter to enable that in the future.
> Prototype IFile : Define a memory & merge optimized vertex-intermediate file
> format for Tez
> -------------------------------------------------------------------------------------------
>
> Key: TEZ-1228
> URL: https://issues.apache.org/jira/browse/TEZ-1228
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Labels: perfomance
> Attachments: TEZ-1228-IFile.pdf, TEZ-1228.1.patch,
> TEZ-1228.WIP.1.patch, TEZ-1228.WIP.2.patch
>
>
> The current vertex-intermediate format used all across Tez is a flat file of
> variable length k,v pairs. For a significant number of use-cases, in
> particular the sorted output phase, a large number of consecutive identical
> keys are found within the same stream. The IFile format ends up writing each
> key out fully into the stream to generate (K,V) pairs instead of ordering it
> into a more efficient K, {V1, .. Vn} list.
> This duplication of key data needs larger buffers to hold in memory and
> requires comparison between keys known to be identical while doing a merge
> sort.
> This bug tracks the building of a prototype IFile format which is optimized
> for lower uncompressed sizes within memory buffers and less compute intensive
> to perform merge sorts during the reducer phase.
--
This message was sent by Atlassian JIRA
(v6.2#6252)