Re: Review Request 30739: HIVE-9574 Lazy computing in HiveBaseFunctionResultList may hurt performance [Spark Branch]

Xuefu Zhang Sun, 08 Feb 2015 01:20:07 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30739/#review71546
-----------------------------------------------------------

High-level comments:
1. File name convention needs to take care of multiple threads doing the same 
thing in a single JVM.
2. Avoid creating Tuple2 objects. There is no need to have this object, but we 
create them just to wrap key/value, likely increase GC pressure.
3. Disk writing can be more performing. We should be able to put keys/values as 
bytes in a large byte[] (as the buffer). The we need to spill, we write the 
whole byte[] to disk in one shot.
4. Related to #3, but I'm wondering why we need kryo at all.
5. We need to take care of file closing in case of exceptions. The call usually 
goes into a finally block.
6. It's better to allocate the buffer in the beginning and fail right way in 
case of OOM. We don't want the job to fail when only the last row needs to 
spill and incur OOM then.

ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java
<https://reviews.apache.org/r/30739/#comment117291>

    private final?

- Xuefu Zhang

On Feb. 7, 2015, 3:09 a.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30739/
> -----------------------------------------------------------
> 
> (Updated Feb. 7, 2015, 3:09 a.m.)
> 
> 
> Review request for hive, Rui Li and Xuefu Zhang.
> 
> 
> Bugs: HIVE-9574
>     https://issues.apache.org/jira/browse/HIVE-9574
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Result KV cache doesn't use RowContainer any more since it has logic we don't 
> need, which is some overhead. We don't do lazy computing right away, instead 
> we wait a little till the cache is close to spill.
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java
>  78ab680 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java 
> 8ead0cb 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveMapFunction.java 
> 7a09b4d 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveMapFunctionResultList.java
>  e92e299 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 
> 070ea4d 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java
>  d4ff37c 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/KryoSerializer.java 
> 286816b 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 
> 0df4598 
> 
> Diff: https://reviews.apache.org/r/30739/diff/
> 
> 
> Testing
> -------
> 
> Unit test, test on cluster
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 30739: HIVE-9574 Lazy computing in HiveBaseFunctionResultList may hurt performance [Spark Branch]

Reply via email to