
still haven't solved this problem. Any help is appreciated.


2014-03-14 10:55 GMT+01:00 fab wol <>:

> Hey Nitin,
> in import1 are at least 1.2 mio rows, with almost the same amount of
> distinct id's and approxametly 40k distinct keywords. et_keywords contains
> roundabout 2000 keywords. So the result of this cross join will be ca. 2.4
> bio rows which need to be checked (see INSTR() function).
> Thx for looking into ...
> Cheers
> Wolli
> 2014-03-05 15:35 GMT+01:00 Nitin Pawar <>:
>> setting number of reducers will not help normally unless there are those
>> many keys for reducers. even if it launches those many reducers, it may
>> just happen that most of them just wont get any data.
>> can you share how many different ids are there and whats the data sizes
>> in rows?
>> On Wed, Mar 5, 2014 at 7:57 PM, fab wol <> wrote:
>>> hey Yong,
>>> Even without the group by (pure cross join) the query is only using one
>>> reducer. Even specifying more reducers doesn't help:
>>> set mapred.reduce.tasks=50;
>>> SELECT id1,
>>>        m.keyword,
>>>        prep_kw.keyword
>>> FROM (select id1, keyword from import1) m
>>>   (SELECT keyword FROM et_keywords) prep_kw;
>>> ...
>>> Hadoop job information for Stage-1: number of mappers: 3; number of
>>> reducers: 1
>>> What could be setup wrong here? Or can it be avoided to use this ugly
>>> cross join at all? I mean my original problem is actually something else ;-)
>>> Cheers
>>> Wolli
>>> 2014-03-05 15:07 GMT+01:00 java8964 <>:
>>> Hi, Wolli:
>>>> Cross join doesn't mean Hive has to use one reduce.
>>>> From query point of view, the following cases will use one reducer:
>>>> 1) Order by in your query (Instead of using sort by)
>>>> 2) Only one reducer group, which means all the data have to send to one
>>>> reducer, as there is only one reducer group.
>>>> In your case, distinct count of id1 will be the reducer group count.
>>>> Did you explicitly set the reducer count in your hive session?
>>>> Yong
>>>> ------------------------------
>>>> Date: Wed, 5 Mar 2014 14:17:24 +0100
>>>> Subject: Best way to avoid cross join
>>>> From:
>>>> To:
>>>> Hey everyone,
>>>> before i write a lot of text, i just post something which is already
>>>> written:
>>>> The first posts adresses a pretty similar problem i also have.
>>>> Currently my implementation looks like this:
>>>> SELECT id1,
>>>>   MAX(
>>>>   CASE
>>>>     WHEN m.keyword IS NULL
>>>>     THEN 0
>>>>     WHEN instr(m.keyword, prep_kw.keyword) > 0
>>>>     THEN 1
>>>>     ELSE 0
>>>>   END) AS flag
>>>> FROM (select id1, keyword from import1) m
>>>>   (SELECT keyword FROM et_keywords) prep_kw
>>>> GROUP BY id1;
>>>> Since there is a cross join involved, the execution gets pinned down to
>>>> 1 reducer only and it takes ages to complete.
>>>> The thread i posted is solving this with some special SQLserver
>>>> tactics. But I was wondering if anybody has encountered the problem in Hive
>>>> already and found a better way to solve this.
>>>> I'm using Hive 0.11 on a MapR Distribution, if this is somehow
>>>> important.
>>>> Cheers
>>>> Wolli
>> --
>> Nitin Pawar

Reply via email to