Apologies if this has already been addressed / discussed - my searching of
jiras and mailing list did not find anything on this topic. Pointers
welcome.

Been experimenting a little with vectorized execution in hive 0.13 and
found that group-by is super slow on string columns. This simple query is
13x slower when vectorization is enabled (c_customer_id is string). Don't
see this problem with int types.

select c_customer_id from customer group by c_customer_id limit 10;


Hprof of mapper shows that hashing of keys seems to dominate execution
time.

Thanks
Siva

CPU SAMPLES BEGIN (total = 95560) Sun Mar 29 21:07:21 2015
rank   self  accum   count trace method
   1 32.09% 32.09%   30664 301072 java.util.HashMap.getEntry
   2 25.75% 57.84%   24604 301041 java.util.HashMap.put
   3 18.62% 76.45%   17791 301633 java.io.FileOutputStream.writeBytes
   4  5.67% 82.12%    5416 300917 java.net.SocketInputStream.socketRead0
   5  4.78% 86.90%    4568 300674 java.io.FileInputStream.available
   6  1.51% 88.42%    1447 301610
org.apache.hadoop.util.LexicographicalComparerHolder$UnsafeComparer.compareTo

TRACE 301072:
        java.util.HashMap.getEntry(HashMap.java:467)
        java.util.HashMap.get(HashMap.java:417)
        
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeHashAggregate.prepareBatchAggregationBufferSets(VectorGroupByOperator.java:353)
        
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeHashAggregate.processBatch(VectorGroupByOperator.java:292)

Reply via email to