Apologies if this has already been addressed / discussed - my searching of
jiras and mailing list did not find anything on this topic. Pointers
welcome.
Been experimenting a little with vectorized execution in hive 0.13 and
found that group-by is super slow on string columns. This simple query is
13x slower when vectorization is enabled (c_customer_id is string). Don't
see this problem with int types.
select c_customer_id from customer group by c_customer_id limit 10;
Hprof of mapper shows that hashing of keys seems to dominate execution
time.
Thanks
Siva
CPU SAMPLES BEGIN (total = 95560) Sun Mar 29 21:07:21 2015
rank self accum count trace method
1 32.09% 32.09% 30664 301072 java.util.HashMap.getEntry
2 25.75% 57.84% 24604 301041 java.util.HashMap.put
3 18.62% 76.45% 17791 301633 java.io.FileOutputStream.writeBytes
4 5.67% 82.12% 5416 300917 java.net.SocketInputStream.socketRead0
5 4.78% 86.90% 4568 300674 java.io.FileInputStream.available
6 1.51% 88.42% 1447 301610
org.apache.hadoop.util.LexicographicalComparerHolder$UnsafeComparer.compareTo
TRACE 301072:
java.util.HashMap.getEntry(HashMap.java:467)
java.util.HashMap.get(HashMap.java:417)
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeHashAggregate.prepareBatchAggregationBufferSets(VectorGroupByOperator.java:353)
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeHashAggregate.processBatch(VectorGroupByOperator.java:292)