Subject: Successful Implementation of Bucket Map Join Hi,
I hope this message finds you well. I wanted to express my gratitude for the detailed instructions you provided on setting up the Bucket Map Join. Your guidance proved to be extremely helpful and, following your steps, I am pleased to confirm that I have successfully implemented the Bucket Map Join. Once again, I'd like to express my sincerest thanks for your help. It's greatly appreciated. Best Regards, [smartli] Okumin <m...@okumin.com>于2023年7月2日 周日00:43写道: > Hi, > > I understand you are trying MapReduce! I recommend you use Tez unless you > have special reasons. Tez is the recommended engine and I guess more > community members use Hive 3 on Tez. It means you are more likely to get > answers when you encounter trouble. > > Quickly, I succeeded in enabling Bucket Map Join with Hive on MR in the > following settings. Looks like, Bucket Map Join on MR requires the original > plan to have Map Join, so I added a hint with > `hive.ignore.mapjoin.hint=false`. I added `hive.cbo.enable=false` with my > esper. I'm not sure why my environment needed it. > > ``` > > $ beeline -e "explain extended select /*+ MAPJOIN(b) */ * from map_join_tb > a join map_join_tb b on a.id=b.id;" --hiveconf > hive.optimize.bucketmapjoin=true --hiveconf hive.enforce.bucketing=true > --hiveconf hive.ignore.mapjoin.hint=false --hiveconf hive.cbo.enable=false > > ... > > | Map Operator Tree: | > > | TableScan | > > | alias: a | > > ... > > | Map Join Operator | > > ... > > | BucketMapJoin: true > ``` > > Regards, > Okumin > > On Mon, Jun 26, 2023 at 8:30 AM smart li <smartli666...@gmail.com> wrote: > >> Hello, >> >> First of all, I would like to express my gratitude for your responses and >> assistance. I’m currently encountering a scenario where my Hive is not >> choosing BucketMapJoin, and I wonder whether this is due to its underlying >> execution engine, which is MapReduce. >> >> In addition, I am operating in a test environment that uses three VMware >> virtual machines. The memory allocations for these machines are 4GB, 2GB, >> and 2GB, respectively. Of these, the 4GB virtual machine has been assigned >> two CPU cores, while the other two have a single CPU core each. I am unsure >> whether these factors could influence Hive’s decision to not select >> BucketMapJoin. >> >> Any guidance or insight on this matter would be greatly appreciated. >> >> Thanks >> >> Okumin <m...@okumin.com>于2023年6月26日 周一02:23写道: >> >>> Hi smart li, >>> >>> As far as I tried with Hive 3.1.2 on Tez, Bucket Map Join was probably >>> triggered. My configurations could be different from yours, though. >>> >>> # How I tested >>> >>> ## hive-site.xml >>> >>> >>> https://github.com/zookage/zookage/blob/v0.2.3/kubernetes/base/common/config/hive/hive-site.xml >>> >>> ## Prepare test data >>> >>> ``` >>> >>> zookage@client-node-0:~$ cat generate.sh >>> >>> #!/bin/bash >>> >>> >>> for i in {1..8000000}; do >>> >>> echo $i >>> >>> done >>> >>> zookage@client-node-0:~$ bash generate.sh > /tmp/ids.csv >>> >>> zookage@client-node-0:~$ hdfs dfs -copyFromLocal /tmp/ids.csv >>> /tmp/ids.csv >>> >>> zookage@client-node-0:~$ hdfs dfs -tail hdfs:///tmp/ids.csv | tail -n 3 >>> >>> 7999998 >>> >>> 7999999 >>> >>> 8000000 >>> >>> zookage@client-node-0:~$ >>> ``` >>> >>> ## Load test data >>> >>> ``` >>> >>> zookage@client-node-0:~$ beeline -e " >>> >>> > create table map_join_tb(id int) >>> >>> > clustered by (id) into 32 buckets; >>> >>> > load data inpath '/tmp/ids.csv' into table map_join_tb; >>> >>> > " >>> >>> Connecting to >>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >>> >>> Connected to: Apache Hive (version 3.1.2) >>> >>> ... >>> >>> zookage@client-node-0:~$ beeline -e "select min(id), max(id), count(*) >>> from map_join_tb" >>> >>> Connecting to >>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >>> >>> ... >>> >>> +------+----------+----------+ >>> >>> | _c0 | _c1 | _c2 | >>> >>> +------+----------+----------+ >>> >>> | 1 | 8000000 | 8000000 | >>> >>> +------+----------+----------+ >>> >>> zookage@client-node-0:~$ hdfs dfs -ls /user/hive/warehouse/map_join_tb >>> >>> Found 32 items >>> >>> -rw-r--r-- 3 zookage hive 1965281 2023-06-25 16:55 >>> /user/hive/warehouse/map_join_tb/000000_0 >>> >>> -rw-r--r-- 3 zookage hive 1965275 2023-06-25 16:55 >>> /user/hive/warehouse/map_join_tb/000001_0 >>> >>> -rw-r--r-- 3 zookage hive 1965275 2023-06-25 16:55 >>> /user/hive/warehouse/map_join_tb/000002_0 >>> >>> ... >>> >>> -rw-r--r-- 3 zookage hive 1965279 2023-06-25 16:55 >>> /user/hive/warehouse/map_join_tb/000030_0 >>> >>> -rw-r--r-- 3 zookage hive 1965279 2023-06-25 16:55 >>> /user/hive/warehouse/map_join_tb/000031_0 >>> >>> zookage@client-node-0:~$ >>> ``` >>> # Submit a join query >>> >>> Bucket Map Join is chosen. If I remember correctly, we don't need >>> `hive.optimize.bucketmapjoin` or `hive.enforce.bucketmapjoin` for Tez >>> because `hive.convert.join.bucket.mapjoin.tez` is for Tez and it is true by >>> default. >>> >>> ``` >>> >>> zookage@client-node-0:~$ beeline -e "explain select * from map_join_tb >>> a join map_join_tb b on a.id=b.id;" >>> >>> Connecting to >>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >>> >>> ... >>> >>> | Stage-1 | >>> >>> | Map 1 vectorized | >>> >>> | File Output Operator [FS_23] | >>> >>> | Map Join Operator [MAPJOIN_22] (rows=8800000 width=6) | >>> >>> | >>> BucketMapJoin:true,Conds:SEL_21._col0=RS_19._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1"] >>> | >>> >>> ... >>> >>> >>> ``` >>> >>> I guess some preconditions are different. You may also see the logs >>> around ConvertJoinMapJoin on HiveServer 2 but it is sometimes misleading in >>> my experience. >>> >>> - You applied some patches to your Hive? >>> - You use different configurations? >>> - Stats could be invalid? >>> >>> By the way, on my machine, Hive 3.1.2 with Bucket Map Join generated >>> incorrect rows. I have not taken a look at the reason(and it might be >>> possible that the root cause exists in my environment). >>> >>> ``` >>> >>> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), >>> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" >>> >>> Connecting to >>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >>> >>> ... >>> >>> >>> ---------------------------------------------------------------------------------------------- >>> VERTICES MODE STATUS TOTAL COMPLETED RUNNING >>> PENDING FAILED KILLED >>> >>> ---------------------------------------------------------------------------------------------- >>> Map 3 .......... container SUCCEEDED 3 3 0 >>> 0 0 0 >>> Map 1 .......... container SUCCEEDED 32 32 0 >>> 0 0 0 >>> Reducer 2 ...... container SUCCEEDED 1 1 0 >>> 0 0 0 >>> >>> ---------------------------------------------------------------------------------------------- >>> VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: >>> 17.36 s >>> >>> ---------------------------------------------------------------------------------------------- >>> ... >>> +------+----------+---------+ >>> | _c0 | _c1 | _c2 | >>> +------+----------+---------+ >>> | 46 | 7999987 | 249776 | >>> +------+----------+---------+ >>> >>> ... >>> >>> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id), >>> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;" >>> --hiveconf hive.convert.join.bucket.mapjoin.tez=false >>> >>> Connecting to >>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage >>> >>> ... >>> >>> >>> ---------------------------------------------------------------------------------------------- >>> VERTICES MODE STATUS TOTAL COMPLETED RUNNING >>> PENDING FAILED KILLED >>> >>> ---------------------------------------------------------------------------------------------- >>> Map 1 .......... container SUCCEEDED 3 3 0 >>> 0 0 0 >>> Map 4 .......... container SUCCEEDED 3 3 0 >>> 0 0 0 >>> Reducer 2 ...... container SUCCEEDED 1 1 0 >>> 0 0 0 >>> Reducer 3 ...... container SUCCEEDED 1 1 0 >>> 0 0 0 >>> >>> ---------------------------------------------------------------------------------------------- >>> VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: >>> 29.79 s >>> >>> ---------------------------------------------------------------------------------------------- >>> ... >>> +------+----------+----------+ >>> | _c0 | _c1 | _c2 | >>> +------+----------+----------+ >>> | 1 | 8000000 | 8000000 | >>> +------+----------+----------+ >>> >>> ``` >>> >>> To be honest, I am not so familiar with this feature. Another expert >>> might give better insights to you. >>> >>> Thanks. >>> >>> >>> On Sun, Jun 25, 2023 at 8:43 PM smart li <smartli666...@gmail.com> >>> wrote: >>> >>>> Hello Hive Users, >>>> >>>> I’m currently trying to understand how Bucket Map Join works in Hive, >>>> but I’m encountering some issues that I need help with. Here’s what I did: >>>> >>>> Firstly, I created a Hive table using the following statement: >>>> >>>> create table map_join_tb( >>>> id int >>>> ) >>>> clustered by (id) into 32 buckets; >>>> >>>> Then, I inserted 8 million rows of data into the table, with the ‘id’ >>>> field ranging from 1 to 8 million. After the data was bucketed, each bucket >>>> was approximately 2MB in size. >>>> >>>> I then set the following bucket map join configurations: >>>> >>>> set hive.optimize.bucketmapjoin=true; >>>> set hive.enforce.bucketmapjoin=true; >>>> >>>> Lastly, I ran an EXPLAIN on the following SQL: >>>> >>>> explain select * from map_join_tb a join map_join_tb b on a.id=b.id; >>>> >>>> Surprisingly, it seems that a Reduce Join, not a Bucket Map Join, was >>>> being performed according to the plan. I’m not sure why this is happening. >>>> Under what conditions does Hive decide to perform a Bucket Map Join? >>>> >>>> For reference, I am using Hive version 3.1.2. Any help or insights into >>>> this would be greatly appreciated. Thank you very much in advance. >>>> >>>> Best Regards, >>>> [smartli] >>>> >>>