Re: Best way to avoid cross join

Nitin Pawar Wed, 19 Mar 2014 05:53:06 -0700

from the mail thread's last line

The correct fix would be to have 1 reducer in case of a Cartesian
product


hack to avoid this was (1 = 1). I think that's been taken care by the
hash partitioner  to go to single reducer.

Other option (atleast for me) looks like to go PIG script. Never tried
so you may give it a shot.



On Wed, Mar 19, 2014 at 6:00 PM, fab wol <darkwoll...@gmail.com> wrote:

> Hey Nitin,
>
> Yong wrote exactly the oppsoite in his first sentence:
>
> *Cross join doesn't mean Hive has to use one reduce.*
>
> and this super old thread here lets me also assume that there can be used
> more than one reducer:
>
> http://mail-archives.apache.org/mod_mbox/hive-user/200904.mbox/%3ca132f89f9b9df241b1c3c33e7299beb90486361...@sc-mbxc1.thefacebook.com%3E
>
> "full outer join" with "on 1 = 1" also uses unfortunately one reducer ...
>
> Cheers
> Wolli
>
>
>
> 2014-03-19 11:47 GMT+01:00 Nitin Pawar <nitinpawar...@gmail.com>:
>
> hey Wolli,
>>
>> sorry missed this one.
>>
>> as Yong already replied, cross join always uses only one reducer.
>>
>> If you want to avoid this can you just try it to make full outer join
>> with on condition (1 = 1) ? and see if you get your desired result
>>
>>
>> On Wed, Mar 19, 2014 at 4:05 PM, fab wol <darkwoll...@gmail.com> wrote:
>>
>>> anyone?
>>>
>>> still haven't solved this problem. Any help is appreciated.
>>>
>>> Cheers
>>> Wolli
>>>
>>>
>>> 2014-03-14 10:55 GMT+01:00 fab wol <darkwoll...@gmail.com>:
>>>
>>> Hey Nitin,
>>>>
>>>> in import1 are at least 1.2 mio rows, with almost the same amount of
>>>> distinct id's and approxametly 40k distinct keywords. et_keywords contains
>>>> roundabout 2000 keywords. So the result of this cross join will be ca. 2.4
>>>> bio rows which need to be checked (see INSTR() function).
>>>>
>>>> Thx for looking into ...
>>>>
>>>> Cheers
>>>> Wolli
>>>>
>>>>
>>>> 2014-03-05 15:35 GMT+01:00 Nitin Pawar <nitinpawar...@gmail.com>:
>>>>
>>>>> setting number of reducers will not help normally unless there are
>>>>> those many keys for reducers. even if it launches those many reducers, it
>>>>> may just happen that most of them just wont get any data.
>>>>>
>>>>> can you share how many different ids are there and whats the data
>>>>> sizes in rows?
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 7:57 PM, fab wol <darkwoll...@gmail.com> wrote:
>>>>>
>>>>>> hey Yong,
>>>>>>
>>>>>> Even without the group by (pure cross join) the query is only using
>>>>>> one reducer. Even specifying more reducers doesn't help:
>>>>>>
>>>>>> set mapred.reduce.tasks=50;
>>>>>> SELECT id1,
>>>>>>        m.keyword,
>>>>>>        prep_kw.keyword
>>>>>> FROM (select id1, keyword from import1) m
>>>>>>  CROSS JOIN
>>>>>>   (SELECT keyword FROM et_keywords) prep_kw;
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> Hadoop job information for Stage-1: number of mappers: 3; number of
>>>>>> reducers: 1
>>>>>>
>>>>>> What could be setup wrong here? Or can it be avoided to use this ugly
>>>>>> cross join at all? I mean my original problem is actually something else 
>>>>>> ;-)
>>>>>>
>>>>>> Cheers
>>>>>> Wolli
>>>>>>
>>>>>>
>>>>>> 2014-03-05 15:07 GMT+01:00 java8964 <java8...@hotmail.com>:
>>>>>>
>>>>>> Hi, Wolli:
>>>>>>>
>>>>>>> Cross join doesn't mean Hive has to use one reduce.
>>>>>>>
>>>>>>> From query point of view, the following cases will use one reducer:
>>>>>>>
>>>>>>> 1) Order by in your query (Instead of using sort by)
>>>>>>> 2) Only one reducer group, which means all the data have to send to
>>>>>>> one reducer, as there is only one reducer group.
>>>>>>>
>>>>>>> In your case, distinct count of id1 will be the reducer group count.
>>>>>>> Did you explicitly set the reducer count in your hive session?
>>>>>>>
>>>>>>> Yong
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> Date: Wed, 5 Mar 2014 14:17:24 +0100
>>>>>>> Subject: Best way to avoid cross join
>>>>>>> From: darkwoll...@gmail.com
>>>>>>> To: user@hive.apache.org
>>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> before i write a lot of text, i just post something which is already
>>>>>>> written:
>>>>>>> http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx
>>>>>>>
>>>>>>> The first posts adresses a pretty similar problem i also have.
>>>>>>> Currently my implementation looks like this:
>>>>>>>
>>>>>>> SELECT id1,
>>>>>>>   MAX(
>>>>>>>   CASE
>>>>>>>     WHEN m.keyword IS NULL
>>>>>>>     THEN 0
>>>>>>>     WHEN instr(m.keyword, prep_kw.keyword) > 0
>>>>>>>     THEN 1
>>>>>>>     ELSE 0
>>>>>>>   END) AS flag
>>>>>>> FROM (select id1, keyword from import1) m
>>>>>>> CROSS JOIN
>>>>>>>   (SELECT keyword FROM et_keywords) prep_kw
>>>>>>> GROUP BY id1;
>>>>>>>
>>>>>>> Since there is a cross join involved, the execution gets pinned down
>>>>>>> to 1 reducer only and it takes ages to complete.
>>>>>>>
>>>>>>> The thread i posted is solving this with some special SQLserver
>>>>>>> tactics. But I was wondering if anybody has encountered the problem in 
>>>>>>> Hive
>>>>>>> already and found a better way to solve this.
>>>>>>>
>>>>>>> I'm using Hive 0.11 on a MapR Distribution, if this is somehow
>>>>>>> important.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Wolli
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Best way to avoid cross join

Reply via email to