Re: Best way to avoid cross join

fab wol Wed, 19 Mar 2014 03:36:31 -0700

anyone?

still haven't solved this problem. Any help is appreciated.


Cheers
Wolli


2014-03-14 10:55 GMT+01:00 fab wol <darkwoll...@gmail.com>:

> Hey Nitin,
>
> in import1 are at least 1.2 mio rows, with almost the same amount of
> distinct id's and approxametly 40k distinct keywords. et_keywords contains
> roundabout 2000 keywords. So the result of this cross join will be ca. 2.4
> bio rows which need to be checked (see INSTR() function).
>
> Thx for looking into ...
>
> Cheers
> Wolli
>
>
> 2014-03-05 15:35 GMT+01:00 Nitin Pawar <nitinpawar...@gmail.com>:
>
>> setting number of reducers will not help normally unless there are those
>> many keys for reducers. even if it launches those many reducers, it may
>> just happen that most of them just wont get any data.
>>
>> can you share how many different ids are there and whats the data sizes
>> in rows?
>>
>>
>> On Wed, Mar 5, 2014 at 7:57 PM, fab wol <darkwoll...@gmail.com> wrote:
>>
>>> hey Yong,
>>>
>>> Even without the group by (pure cross join) the query is only using one
>>> reducer. Even specifying more reducers doesn't help:
>>>
>>> set mapred.reduce.tasks=50;
>>> SELECT id1,
>>>        m.keyword,
>>>        prep_kw.keyword
>>> FROM (select id1, keyword from import1) m
>>>  CROSS JOIN
>>>   (SELECT keyword FROM et_keywords) prep_kw;
>>>
>>> ...
>>>
>>> Hadoop job information for Stage-1: number of mappers: 3; number of
>>> reducers: 1
>>>
>>> What could be setup wrong here? Or can it be avoided to use this ugly
>>> cross join at all? I mean my original problem is actually something else ;-)
>>>
>>> Cheers
>>> Wolli
>>>
>>>
>>> 2014-03-05 15:07 GMT+01:00 java8964 <java8...@hotmail.com>:
>>>
>>> Hi, Wolli:
>>>>
>>>> Cross join doesn't mean Hive has to use one reduce.
>>>>
>>>> From query point of view, the following cases will use one reducer:
>>>>
>>>> 1) Order by in your query (Instead of using sort by)
>>>> 2) Only one reducer group, which means all the data have to send to one
>>>> reducer, as there is only one reducer group.
>>>>
>>>> In your case, distinct count of id1 will be the reducer group count.
>>>> Did you explicitly set the reducer count in your hive session?
>>>>
>>>> Yong
>>>>
>>>> ------------------------------
>>>> Date: Wed, 5 Mar 2014 14:17:24 +0100
>>>> Subject: Best way to avoid cross join
>>>> From: darkwoll...@gmail.com
>>>> To: user@hive.apache.org
>>>>
>>>> Hey everyone,
>>>>
>>>> before i write a lot of text, i just post something which is already
>>>> written:
>>>> http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx
>>>>
>>>> The first posts adresses a pretty similar problem i also have.
>>>> Currently my implementation looks like this:
>>>>
>>>> SELECT id1,
>>>>   MAX(
>>>>   CASE
>>>>     WHEN m.keyword IS NULL
>>>>     THEN 0
>>>>     WHEN instr(m.keyword, prep_kw.keyword) > 0
>>>>     THEN 1
>>>>     ELSE 0
>>>>   END) AS flag
>>>> FROM (select id1, keyword from import1) m
>>>> CROSS JOIN
>>>>   (SELECT keyword FROM et_keywords) prep_kw
>>>> GROUP BY id1;
>>>>
>>>> Since there is a cross join involved, the execution gets pinned down to
>>>> 1 reducer only and it takes ages to complete.
>>>>
>>>> The thread i posted is solving this with some special SQLserver
>>>> tactics. But I was wondering if anybody has encountered the problem in Hive
>>>> already and found a better way to solve this.
>>>>
>>>> I'm using Hive 0.11 on a MapR Distribution, if this is somehow
>>>> important.
>>>>
>>>> Cheers
>>>> Wolli
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Best way to avoid cross join

Reply via email to