Chenyu Zheng created TEZ-4542:
---------------------------------
Summary: Tez application may fail due to int overflow when record
size is large and sort memory is low.
Key: TEZ-4542
URL: https://issues.apache.org/jira/browse/TEZ-4542
Project: Apache Tez
Issue Type: Bug
Affects Versions: 0.9.2
Reporter: Chenyu Zheng
Assignee: Chenyu Zheng
Tez application application fail, then found this error stack:
{code:java}
at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.IllegalArgumentException
at
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:402)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:907)
at
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:643)
at
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genAllOneUniqueJoinObject(CommonJoinOperator.java:675)
at
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:753)
at
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:314)
at
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:277)
at
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:270)
at
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:256)
at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
... 19 more
Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter$SortSpan.(PipelinedSorter.java:936)
at
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.sort(PipelinedSorter.java:350)
at
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:406)
at
org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:379)
at
org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:167)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor$TezKVOutputCollector.collect(TezProcessor.java:204)
at
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.collect(ReduceSinkOperator.java:541)
at
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:385)
... 28 more {code}
After adding the debug log, it is easy to find this problem. The variable
`dataSize` in {{{}PipelinedSorter::{}}}SortSpan is overflow.
This problem will be triggered if the following two conditions are met at the
same time:
* Too many IO for vertex, causing the memory allocated to each I/O for sorting
to be too small.
* When average record size is larger than 2K, `dataSize` in
{{{}PipelinedSorter::{}}}SortSpan is overflow will be overflow, will not
try to allocate less meta space. then raise exception.
{{}}
Solution: change dataSize to long
--
This message was sent by Atlassian Jira
(v8.20.10#820010)