Hey Nitin, Yong wrote exactly the oppsoite in his first sentence:
*Cross join doesn't mean Hive has to use one reduce.* and this super old thread here lets me also assume that there can be used more than one reducer: http://mail-archives.apache.org/mod_mbox/hive-user/200904.mbox/%3ca132f89f9b9df241b1c3c33e7299beb90486361...@sc-mbxc1.thefacebook.com%3E "full outer join" with "on 1 = 1" also uses unfortunately one reducer ... Cheers Wolli 2014-03-19 11:47 GMT+01:00 Nitin Pawar <nitinpawar...@gmail.com>: > hey Wolli, > > sorry missed this one. > > as Yong already replied, cross join always uses only one reducer. > > If you want to avoid this can you just try it to make full outer join with > on condition (1 = 1) ? and see if you get your desired result > > > On Wed, Mar 19, 2014 at 4:05 PM, fab wol <darkwoll...@gmail.com> wrote: > >> anyone? >> >> still haven't solved this problem. Any help is appreciated. >> >> Cheers >> Wolli >> >> >> 2014-03-14 10:55 GMT+01:00 fab wol <darkwoll...@gmail.com>: >> >> Hey Nitin, >>> >>> in import1 are at least 1.2 mio rows, with almost the same amount of >>> distinct id's and approxametly 40k distinct keywords. et_keywords contains >>> roundabout 2000 keywords. So the result of this cross join will be ca. 2.4 >>> bio rows which need to be checked (see INSTR() function). >>> >>> Thx for looking into ... >>> >>> Cheers >>> Wolli >>> >>> >>> 2014-03-05 15:35 GMT+01:00 Nitin Pawar <nitinpawar...@gmail.com>: >>> >>>> setting number of reducers will not help normally unless there are >>>> those many keys for reducers. even if it launches those many reducers, it >>>> may just happen that most of them just wont get any data. >>>> >>>> can you share how many different ids are there and whats the data sizes >>>> in rows? >>>> >>>> >>>> On Wed, Mar 5, 2014 at 7:57 PM, fab wol <darkwoll...@gmail.com> wrote: >>>> >>>>> hey Yong, >>>>> >>>>> Even without the group by (pure cross join) the query is only using >>>>> one reducer. Even specifying more reducers doesn't help: >>>>> >>>>> set mapred.reduce.tasks=50; >>>>> SELECT id1, >>>>> m.keyword, >>>>> prep_kw.keyword >>>>> FROM (select id1, keyword from import1) m >>>>> CROSS JOIN >>>>> (SELECT keyword FROM et_keywords) prep_kw; >>>>> >>>>> ... >>>>> >>>>> Hadoop job information for Stage-1: number of mappers: 3; number of >>>>> reducers: 1 >>>>> >>>>> What could be setup wrong here? Or can it be avoided to use this ugly >>>>> cross join at all? I mean my original problem is actually something else >>>>> ;-) >>>>> >>>>> Cheers >>>>> Wolli >>>>> >>>>> >>>>> 2014-03-05 15:07 GMT+01:00 java8964 <java8...@hotmail.com>: >>>>> >>>>> Hi, Wolli: >>>>>> >>>>>> Cross join doesn't mean Hive has to use one reduce. >>>>>> >>>>>> From query point of view, the following cases will use one reducer: >>>>>> >>>>>> 1) Order by in your query (Instead of using sort by) >>>>>> 2) Only one reducer group, which means all the data have to send to >>>>>> one reducer, as there is only one reducer group. >>>>>> >>>>>> In your case, distinct count of id1 will be the reducer group count. >>>>>> Did you explicitly set the reducer count in your hive session? >>>>>> >>>>>> Yong >>>>>> >>>>>> ------------------------------ >>>>>> Date: Wed, 5 Mar 2014 14:17:24 +0100 >>>>>> Subject: Best way to avoid cross join >>>>>> From: darkwoll...@gmail.com >>>>>> To: user@hive.apache.org >>>>>> >>>>>> Hey everyone, >>>>>> >>>>>> before i write a lot of text, i just post something which is already >>>>>> written: >>>>>> http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx >>>>>> >>>>>> The first posts adresses a pretty similar problem i also have. >>>>>> Currently my implementation looks like this: >>>>>> >>>>>> SELECT id1, >>>>>> MAX( >>>>>> CASE >>>>>> WHEN m.keyword IS NULL >>>>>> THEN 0 >>>>>> WHEN instr(m.keyword, prep_kw.keyword) > 0 >>>>>> THEN 1 >>>>>> ELSE 0 >>>>>> END) AS flag >>>>>> FROM (select id1, keyword from import1) m >>>>>> CROSS JOIN >>>>>> (SELECT keyword FROM et_keywords) prep_kw >>>>>> GROUP BY id1; >>>>>> >>>>>> Since there is a cross join involved, the execution gets pinned down >>>>>> to 1 reducer only and it takes ages to complete. >>>>>> >>>>>> The thread i posted is solving this with some special SQLserver >>>>>> tactics. But I was wondering if anybody has encountered the problem in >>>>>> Hive >>>>>> already and found a better way to solve this. >>>>>> >>>>>> I'm using Hive 0.11 on a MapR Distribution, if this is somehow >>>>>> important. >>>>>> >>>>>> Cheers >>>>>> Wolli >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>>> >>> >>> >> > > > -- > Nitin Pawar >