[ 
https://issues.apache.org/jira/browse/TEZ-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609529#comment-14609529
 ] 

Rajesh Balamohan commented on TEZ-2575:
---------------------------------------

bq.I had one question, with your proposed approach. In the worst case, there 
will be bufferList.size() number of additional spills per KV pair which doesnt 
fit into the buffer.(e.g say 4 blocks in buffer list, and the first allocated 
span is empty. If the KV doesnt fit into this span, collect() method will be 
recursively called 4 times, and 4 empty spills will happen until 
bufferOverflowRecursion > bufferList.size() condition is hit and that is when 
KV will be spilled to disk). By indicating a status in sort(), I am trying to 
catch the condition early, to avoid extra spills.
- Correct. Need to add "(span.length() == 0)" as an OR condition in the 
statement. Also need to add a check in sort() to sort/spill only when there are 
elements added in the span (i.e (span.length() > 0)). This would take care of 
the next comment as well.
- Patch might not use all the blocks which might be available before doing a 
single record spill. E.g Consider 512MB sort space with 256MB block size.  If 
we write a KV pair (2x20 MB) and try to write another KV pair (2x110MB), it 
would throw BufferOverflowException. After sorting, it would try to create 
another span with the remaining space in the first block (i.e ~ 256 - 40). Even 
with this, it would not be able to fit in the new KVPair. As per current logic 
in the patch, it would end up spilling even though the same key could have been 
accomodated in the next span from next block.


> Handle KeyValue pairs size which do not fit in a single block
> -------------------------------------------------------------
>
>                 Key: TEZ-2575
>                 URL: https://issues.apache.org/jira/browse/TEZ-2575
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Saikat
>            Assignee: Saikat
>         Attachments: TEZ-2575.1.patch, TEZ-2575.2.patch, TEZ-2575.patch
>
>
> In the present implementation, the available buffer is divided into blocks 
> (specified in the constructor for pipeline sort). and a linked list of these 
> block byte buffers is maintained. 
> A span is created out of the buffers. 
> The present logic, doesnot handle scenario where a single key-value pair size 
> doesnot fit into any of the blocks.
> example if 1mb total memory is divided into 4 blocks, (256 kb each),
> if a single KV pair is greater than the blocksize(~ignoring meta data size), 
> then it fails with buffer exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to