[ https://issues.apache.org/jira/browse/HIVE-17220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108110#comment-16108110 ]
Prasanth Jayachandran commented on HIVE-17220: ---------------------------------------------- Although bloom-1 is fast in microbenchmarks (2-5x faster as there is only 1 memory access), there is around 2% increase in fpp. This will let more rows pass through the bloom filter negating the performance gain. Alternative, approach is to increase the stride size for hash mapping to more than 1 long. Will update the patch shortly with bloom-k implementation. > Bloomfilter probing in semijoin reduction is thrashing L1 dcache > ---------------------------------------------------------------- > > Key: HIVE-17220 > URL: https://issues.apache.org/jira/browse/HIVE-17220 > Project: Hive > Issue Type: Bug > Affects Versions: 3.0.0 > Reporter: Prasanth Jayachandran > Assignee: Prasanth Jayachandran > Attachments: HIVE-17220.WIP.patch > > > [~gopalv] observed perf profiles showing bloomfilter probes as bottleneck for > some of the TPC-DS queries and resulted L1 data cache thrashing. > This is because of the huge bitset in bloom filter that doesn't fit in any > levels of cache, also the hash bits corresponding to a single key map to > different segments of bitset which are spread out. This can result in K-1 > memory access (K being number of hash functions) in worst case for every key > that gets probed because of locality miss in L1 cache. > Ran a JMH microbenchmark to verify the same. Following is the JMH perf > profile for bloom filter probing > {code} > Perf stats: > -------------------------------------------------- > 5101.935637 task-clock (msec) # 0.461 CPUs utilized > 346 context-switches # 0.068 K/sec > 336 cpu-migrations # 0.066 K/sec > 6,207 page-faults # 0.001 M/sec > 10,016,486,301 cycles # 1.963 GHz > (26.90%) > 5,751,692,176 stalled-cycles-frontend # 57.42% frontend cycles > idle (27.05%) > <not supported> stalled-cycles-backend > 14,359,914,397 instructions # 1.43 insns per cycle > # 0.40 stalled cycles > per insn (33.78%) > 2,200,632,861 branches # 431.333 M/sec > (33.84%) > 1,162,860 branch-misses # 0.05% of all branches > (33.97%) > 1,025,992,254 L1-dcache-loads # 201.099 M/sec > (26.56%) > 432,663,098 L1-dcache-load-misses # 42.17% of all L1-dcache > hits (14.49%) > 331,383,297 LLC-loads # 64.952 M/sec > (14.47%) > 203,524 LLC-load-misses # 0.06% of all LL-cache > hits (21.67%) > <not supported> L1-icache-loads > 1,633,821 L1-icache-load-misses # 0.320 M/sec > (28.85%) > 950,368,796 dTLB-loads # 186.276 M/sec > (28.61%) > 246,813,393 dTLB-load-misses # 25.97% of all dTLB > cache hits (14.53%) > 25,451 iTLB-loads # 0.005 M/sec > (14.48%) > 35,415 iTLB-load-misses # 139.15% of all iTLB > cache hits (21.73%) > <not supported> L1-dcache-prefetches > 175,958 L1-dcache-prefetch-misses # 0.034 M/sec > (28.94%) > 11.064783140 seconds time elapsed > {code} > This shows 42.17% of L1 data cache misses. > This jira is to use cache efficient bloom filter for semijoin probing. -- This message was sent by Atlassian JIRA (v6.4.14#64029)