Re: Best way to avoid cross join

fab wol Wed, 19 Mar 2014 05:32:20 -0700

Hey Nitin,

Yong wrote exactly the oppsoite in his first sentence:


*Cross join doesn't mean Hive has to use one reduce.*

and this super old thread here lets me also assume that there can be used
more than one reducer:
http://mail-archives.apache.org/mod_mbox/hive-user/200904.mbox/%3ca132f89f9b9df241b1c3c33e7299beb90486361...@sc-mbxc1.thefacebook.com%3E

"full outer join" with "on 1 = 1" also uses unfortunately one reducer ...

Cheers
Wolli



2014-03-19 11:47 GMT+01:00 Nitin Pawar <nitinpawar...@gmail.com>:

> hey Wolli,
>
> sorry missed this one.
>
> as Yong already replied, cross join always uses only one reducer.
>
> If you want to avoid this can you just try it to make full outer join with
> on condition (1 = 1) ? and see if you get your desired result
>
>
> On Wed, Mar 19, 2014 at 4:05 PM, fab wol <darkwoll...@gmail.com> wrote:
>
>> anyone?
>>
>> still haven't solved this problem. Any help is appreciated.
>>
>> Cheers
>> Wolli
>>
>>
>> 2014-03-14 10:55 GMT+01:00 fab wol <darkwoll...@gmail.com>:
>>
>> Hey Nitin,
>>>
>>> in import1 are at least 1.2 mio rows, with almost the same amount of
>>> distinct id's and approxametly 40k distinct keywords. et_keywords contains
>>> roundabout 2000 keywords. So the result of this cross join will be ca. 2.4
>>> bio rows which need to be checked (see INSTR() function).
>>>
>>> Thx for looking into ...
>>>
>>> Cheers
>>> Wolli
>>>
>>>
>>> 2014-03-05 15:35 GMT+01:00 Nitin Pawar <nitinpawar...@gmail.com>:
>>>
>>>> setting number of reducers will not help normally unless there are
>>>> those many keys for reducers. even if it launches those many reducers, it
>>>> may just happen that most of them just wont get any data.
>>>>
>>>> can you share how many different ids are there and whats the data sizes
>>>> in rows?
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 7:57 PM, fab wol <darkwoll...@gmail.com> wrote:
>>>>
>>>>> hey Yong,
>>>>>
>>>>> Even without the group by (pure cross join) the query is only using
>>>>> one reducer. Even specifying more reducers doesn't help:
>>>>>
>>>>> set mapred.reduce.tasks=50;
>>>>> SELECT id1,
>>>>>        m.keyword,
>>>>>        prep_kw.keyword
>>>>> FROM (select id1, keyword from import1) m
>>>>>  CROSS JOIN
>>>>>   (SELECT keyword FROM et_keywords) prep_kw;
>>>>>
>>>>> ...
>>>>>
>>>>> Hadoop job information for Stage-1: number of mappers: 3; number of
>>>>> reducers: 1
>>>>>
>>>>> What could be setup wrong here? Or can it be avoided to use this ugly
>>>>> cross join at all? I mean my original problem is actually something else 
>>>>> ;-)
>>>>>
>>>>> Cheers
>>>>> Wolli
>>>>>
>>>>>
>>>>> 2014-03-05 15:07 GMT+01:00 java8964 <java8...@hotmail.com>:
>>>>>
>>>>> Hi, Wolli:
>>>>>>
>>>>>> Cross join doesn't mean Hive has to use one reduce.
>>>>>>
>>>>>> From query point of view, the following cases will use one reducer:
>>>>>>
>>>>>> 1) Order by in your query (Instead of using sort by)
>>>>>> 2) Only one reducer group, which means all the data have to send to
>>>>>> one reducer, as there is only one reducer group.
>>>>>>
>>>>>> In your case, distinct count of id1 will be the reducer group count.
>>>>>> Did you explicitly set the reducer count in your hive session?
>>>>>>
>>>>>> Yong
>>>>>>
>>>>>> ------------------------------
>>>>>> Date: Wed, 5 Mar 2014 14:17:24 +0100
>>>>>> Subject: Best way to avoid cross join
>>>>>> From: darkwoll...@gmail.com
>>>>>> To: user@hive.apache.org
>>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> before i write a lot of text, i just post something which is already
>>>>>> written:
>>>>>> http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx
>>>>>>
>>>>>> The first posts adresses a pretty similar problem i also have.
>>>>>> Currently my implementation looks like this:
>>>>>>
>>>>>> SELECT id1,
>>>>>>   MAX(
>>>>>>   CASE
>>>>>>     WHEN m.keyword IS NULL
>>>>>>     THEN 0
>>>>>>     WHEN instr(m.keyword, prep_kw.keyword) > 0
>>>>>>     THEN 1
>>>>>>     ELSE 0
>>>>>>   END) AS flag
>>>>>> FROM (select id1, keyword from import1) m
>>>>>> CROSS JOIN
>>>>>>   (SELECT keyword FROM et_keywords) prep_kw
>>>>>> GROUP BY id1;
>>>>>>
>>>>>> Since there is a cross join involved, the execution gets pinned down
>>>>>> to 1 reducer only and it takes ages to complete.
>>>>>>
>>>>>> The thread i posted is solving this with some special SQLserver
>>>>>> tactics. But I was wondering if anybody has encountered the problem in 
>>>>>> Hive
>>>>>> already and found a better way to solve this.
>>>>>>
>>>>>> I'm using Hive 0.11 on a MapR Distribution, if this is somehow
>>>>>> important.
>>>>>>
>>>>>> Cheers
>>>>>> Wolli
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Best way to avoid cross join

Reply via email to