Re: Questions Regarding Bucket Map Join in Hive

smart li Wed, 12 Jul 2023 05:45:47 -0700

Subject: Successful Implementation of Bucket Map Join

Hi,


I hope this message finds you well.

I wanted to express my gratitude for the detailed instructions you provided
on setting up the Bucket Map Join. Your guidance proved to be extremely
helpful and, following your steps, I am pleased to confirm that I have
successfully implemented the Bucket Map Join.

Once again, I'd like to express my sincerest thanks for your help. It's
greatly appreciated.

Best Regards,
[smartli]

Okumin <m...@okumin.com>于2023年7月2日 周日00:43写道：

> Hi,
>
> I understand you are trying MapReduce! I recommend you use Tez unless you
> have special reasons. Tez is the recommended engine and I guess more
> community members use Hive 3 on Tez. It means you are more likely to get
> answers when you encounter trouble.
>
> Quickly, I succeeded in enabling Bucket Map Join with Hive on MR in the
> following settings. Looks like, Bucket Map Join on MR requires the original
> plan to have Map Join, so I added a hint with
> `hive.ignore.mapjoin.hint=false`. I added `hive.cbo.enable=false` with my
> esper. I'm not sure why my environment needed it.
>
> ```
>
> $ beeline -e "explain extended select /*+ MAPJOIN(b) */ * from map_join_tb
> a join map_join_tb b on a.id=b.id;" --hiveconf
> hive.optimize.bucketmapjoin=true --hiveconf hive.enforce.bucketing=true
> --hiveconf hive.ignore.mapjoin.hint=false --hiveconf hive.cbo.enable=false
>
> ...
>
> |       Map Operator Tree:                           |
>
> |           TableScan                                |
>
> |             alias: a                               |
>
> ...
>
> |               Map Join Operator                    |
>
> ...
>
> |                 BucketMapJoin: true
> ```
>
> Regards,
> Okumin
>
> On Mon, Jun 26, 2023 at 8:30 AM smart li <smartli666...@gmail.com> wrote:
>
>> Hello,
>>
>> First of all, I would like to express my gratitude for your responses and
>> assistance. I’m currently encountering a scenario where my Hive is not
>> choosing BucketMapJoin, and I wonder whether this is due to its underlying
>> execution engine, which is MapReduce.
>>
>> In addition, I am operating in a test environment that uses three VMware
>> virtual machines. The memory allocations for these machines are 4GB, 2GB,
>> and 2GB, respectively. Of these, the 4GB virtual machine has been assigned
>> two CPU cores, while the other two have a single CPU core each. I am unsure
>> whether these factors could influence Hive’s decision to not select
>> BucketMapJoin.
>>
>> Any guidance or insight on this matter would be greatly appreciated.
>>
>> Thanks
>>
>> Okumin <m...@okumin.com>于2023年6月26日 周一02:23写道：
>>
>>> Hi smart li,
>>>
>>> As far as I tried with Hive 3.1.2 on Tez, Bucket Map Join was probably
>>> triggered. My configurations could be different from yours, though.
>>>
>>> # How I tested
>>>
>>> ## hive-site.xml
>>>
>>>
>>> https://github.com/zookage/zookage/blob/v0.2.3/kubernetes/base/common/config/hive/hive-site.xml
>>>
>>> ## Prepare test data
>>>
>>> ```
>>>
>>> zookage@client-node-0:~$ cat generate.sh
>>>
>>> #!/bin/bash
>>>
>>>
>>> for i in {1..8000000}; do
>>>
>>>   echo $i
>>>
>>> done
>>>
>>> zookage@client-node-0:~$ bash generate.sh > /tmp/ids.csv
>>>
>>> zookage@client-node-0:~$ hdfs dfs -copyFromLocal /tmp/ids.csv
>>> /tmp/ids.csv
>>>
>>> zookage@client-node-0:~$ hdfs dfs -tail hdfs:///tmp/ids.csv | tail -n 3
>>>
>>> 7999998
>>>
>>> 7999999
>>>
>>> 8000000
>>>
>>> zookage@client-node-0:~$
>>> ```
>>>
>>> ## Load test data
>>>
>>> ```
>>>
>>> zookage@client-node-0:~$ beeline -e "
>>>
>>> > create table map_join_tb(id int)
>>>
>>> > clustered by (id) into 32 buckets;
>>>
>>> > load data inpath '/tmp/ids.csv' into table map_join_tb;
>>>
>>> > "
>>>
>>> Connecting to
>>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>>
>>> Connected to: Apache Hive (version 3.1.2)
>>>
>>> ...
>>>
>>> zookage@client-node-0:~$ beeline -e "select min(id), max(id), count(*)
>>> from map_join_tb"
>>>
>>> Connecting to
>>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>>
>>> ...
>>>
>>> +------+----------+----------+
>>>
>>> | _c0  |   _c1    |   _c2    |
>>>
>>> +------+----------+----------+
>>>
>>> | 1    | 8000000  | 8000000  |
>>>
>>> +------+----------+----------+
>>>
>>> zookage@client-node-0:~$ hdfs dfs -ls /user/hive/warehouse/map_join_tb
>>>
>>> Found 32 items
>>>
>>> -rw-r--r--   3 zookage hive    1965281 2023-06-25 16:55
>>> /user/hive/warehouse/map_join_tb/000000_0
>>>
>>> -rw-r--r--   3 zookage hive    1965275 2023-06-25 16:55
>>> /user/hive/warehouse/map_join_tb/000001_0
>>>
>>> -rw-r--r--   3 zookage hive    1965275 2023-06-25 16:55
>>> /user/hive/warehouse/map_join_tb/000002_0
>>>
>>> ...
>>>
>>> -rw-r--r--   3 zookage hive    1965279 2023-06-25 16:55
>>> /user/hive/warehouse/map_join_tb/000030_0
>>>
>>> -rw-r--r--   3 zookage hive    1965279 2023-06-25 16:55
>>> /user/hive/warehouse/map_join_tb/000031_0
>>>
>>> zookage@client-node-0:~$
>>> ```
>>> # Submit a join query
>>>
>>> Bucket Map Join is chosen. If I remember correctly, we don't need
>>> `hive.optimize.bucketmapjoin` or `hive.enforce.bucketmapjoin` for Tez
>>> because `hive.convert.join.bucket.mapjoin.tez` is for Tez and it is true by
>>> default.
>>>
>>> ```
>>>
>>> zookage@client-node-0:~$ beeline -e "explain select * from map_join_tb
>>> a join map_join_tb b on a.id=b.id;"
>>>
>>> Connecting to
>>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>>
>>> ...
>>>
>>> |     Stage-1                                        |
>>>
>>> |       Map 1 vectorized                             |
>>>
>>> |       File Output Operator [FS_23]                 |
>>>
>>> |         Map Join Operator [MAPJOIN_22] (rows=8800000 width=6) |
>>>
>>> |           
>>> BucketMapJoin:true,Conds:SEL_21._col0=RS_19._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1"]
>>> |
>>>
>>> ...
>>>
>>>
>>> ```
>>>
>>> I guess some preconditions are different. You may also see the logs
>>> around ConvertJoinMapJoin on HiveServer 2 but it is sometimes misleading in
>>> my experience.
>>>
>>> - You applied some patches to your Hive?
>>> - You use different configurations?
>>> - Stats could be invalid?
>>>
>>> By the way, on my machine, Hive 3.1.2 with Bucket Map Join generated
>>> incorrect rows. I have not taken a look at the reason(and it might be
>>> possible that the root cause exists in my environment).
>>>
>>> ```
>>>
>>> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id),
>>> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;"
>>>
>>> Connecting to
>>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>>
>>> ...
>>>
>>>
>>> ----------------------------------------------------------------------------------------------
>>>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
>>>  PENDING  FAILED  KILLED
>>>
>>> ----------------------------------------------------------------------------------------------
>>> Map 3 .......... container     SUCCEEDED      3          3        0
>>>    0       0       0
>>> Map 1 .......... container     SUCCEEDED     32         32        0
>>>    0       0       0
>>> Reducer 2 ...... container     SUCCEEDED      1          1        0
>>>    0       0       0
>>>
>>> ----------------------------------------------------------------------------------------------
>>> VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME:
>>> 17.36 s
>>>
>>> ----------------------------------------------------------------------------------------------
>>> ...
>>> +------+----------+---------+
>>> | _c0  |   _c1    |   _c2   |
>>> +------+----------+---------+
>>> | 46   | 7999987  | 249776  |
>>> +------+----------+---------+
>>>
>>> ...
>>>
>>> zookage@client-node-0:~$ beeline -e "select min(a.id), max(a.id),
>>> count(*) from map_join_tb a join map_join_tb b on a.id=b.id;"
>>> --hiveconf hive.convert.join.bucket.mapjoin.tez=false
>>>
>>> Connecting to
>>> jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
>>>
>>> ...
>>>
>>>
>>> ----------------------------------------------------------------------------------------------
>>>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
>>>  PENDING  FAILED  KILLED
>>>
>>> ----------------------------------------------------------------------------------------------
>>> Map 1 .......... container     SUCCEEDED      3          3        0
>>>    0       0       0
>>> Map 4 .......... container     SUCCEEDED      3          3        0
>>>    0       0       0
>>> Reducer 2 ...... container     SUCCEEDED      1          1        0
>>>    0       0       0
>>> Reducer 3 ...... container     SUCCEEDED      1          1        0
>>>    0       0       0
>>>
>>> ----------------------------------------------------------------------------------------------
>>> VERTICES: 04/04  [==========================>>] 100%  ELAPSED TIME:
>>> 29.79 s
>>>
>>> ----------------------------------------------------------------------------------------------
>>> ...
>>> +------+----------+----------+
>>> | _c0  |   _c1    |   _c2    |
>>> +------+----------+----------+
>>> | 1    | 8000000  | 8000000  |
>>> +------+----------+----------+
>>>
>>> ```
>>>
>>> To be honest, I am not so familiar with this feature. Another expert
>>> might give better insights to you.
>>>
>>> Thanks.
>>>
>>>
>>> On Sun, Jun 25, 2023 at 8:43 PM smart li <smartli666...@gmail.com>
>>> wrote:
>>>
>>>> Hello Hive Users,
>>>>
>>>> I’m currently trying to understand how Bucket Map Join works in Hive,
>>>> but I’m encountering some issues that I need help with. Here’s what I did:
>>>>
>>>> Firstly, I created a Hive table using the following statement:
>>>>
>>>> create table map_join_tb(
>>>> id int
>>>> )
>>>> clustered by (id) into 32 buckets;
>>>>
>>>> Then, I inserted 8 million rows of data into the table, with the ‘id’
>>>> field ranging from 1 to 8 million. After the data was bucketed, each bucket
>>>> was approximately 2MB in size.
>>>>
>>>> I then set the following bucket map join configurations:
>>>>
>>>> set hive.optimize.bucketmapjoin=true;
>>>> set hive.enforce.bucketmapjoin=true;
>>>>
>>>> Lastly, I ran an EXPLAIN on the following SQL:
>>>>
>>>> explain select * from map_join_tb a join map_join_tb b on a.id=b.id;
>>>>
>>>> Surprisingly, it seems that a Reduce Join, not a Bucket Map Join, was
>>>> being performed according to the plan. I’m not sure why this is happening.
>>>> Under what conditions does Hive decide to perform a Bucket Map Join?
>>>>
>>>> For reference, I am using Hive version 3.1.2. Any help or insights into
>>>> this would be greatly appreciated. Thank you very much in advance.
>>>>
>>>> Best Regards,
>>>> [smartli]
>>>>
>>>

Re: Questions Regarding Bucket Map Join in Hive

Reply via email to