[ 
https://issues.apache.org/jira/browse/TEZ-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875368#comment-17875368
 ] 

Chenyu Zheng commented on TEZ-4542:
-----------------------------------

[~glapark] [~abstractdog] 

If revert this patch, we may still have this problem. Consider an extreme case 
where the size of one particular record is particularly large, and the other 
records are normal. If we use below code, metasize will still be small. I think 
maybe we need to delete the optimization code about metasize size.
{code:java}
if(capacity < (metasize+dataSize)) {
  // try to allocate less meta space, because we have sample data
  metasize = METASIZE*(capacity/(perItem+METASIZE));
} {code}
 We can delete these code, even though may wast more memory. Or we can set a 
minimum value for metasize.
 
[~rbalamohan] Can you give us some advice?

> Tez application may fail due to int overflow when record size is large and 
> sort memory is low.
> ----------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4542
>                 URL: https://issues.apache.org/jira/browse/TEZ-4542
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.2
>            Reporter: Chenyu Zheng
>            Assignee: Chenyu Zheng
>            Priority: Major
>             Fix For: 0.10.4
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Tez application application fail, then found this error stack:
> {code:java}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
>   ... 19 more
> Caused by: java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
>   at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
>   ... 28 more {code}
> After adding the debug log, it is easy to find this problem. The variable 
> `dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow. 
> This problem will be triggered if the following two conditions are met at the 
> same time:
>  * Too many IO for vertex, causing the memory allocated to each I/O for 
> sorting to be too small.
>  * When average record size is larger than 2K, `dataSize`  in 
> {{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not 
> try to allocate less meta space. Then raise exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to