anyone? still haven't solved this problem. Any help is appreciated.
Cheers Wolli 2014-03-14 10:55 GMT+01:00 fab wol <darkwoll...@gmail.com>: > Hey Nitin, > > in import1 are at least 1.2 mio rows, with almost the same amount of > distinct id's and approxametly 40k distinct keywords. et_keywords contains > roundabout 2000 keywords. So the result of this cross join will be ca. 2.4 > bio rows which need to be checked (see INSTR() function). > > Thx for looking into ... > > Cheers > Wolli > > > 2014-03-05 15:35 GMT+01:00 Nitin Pawar <nitinpawar...@gmail.com>: > >> setting number of reducers will not help normally unless there are those >> many keys for reducers. even if it launches those many reducers, it may >> just happen that most of them just wont get any data. >> >> can you share how many different ids are there and whats the data sizes >> in rows? >> >> >> On Wed, Mar 5, 2014 at 7:57 PM, fab wol <darkwoll...@gmail.com> wrote: >> >>> hey Yong, >>> >>> Even without the group by (pure cross join) the query is only using one >>> reducer. Even specifying more reducers doesn't help: >>> >>> set mapred.reduce.tasks=50; >>> SELECT id1, >>> m.keyword, >>> prep_kw.keyword >>> FROM (select id1, keyword from import1) m >>> CROSS JOIN >>> (SELECT keyword FROM et_keywords) prep_kw; >>> >>> ... >>> >>> Hadoop job information for Stage-1: number of mappers: 3; number of >>> reducers: 1 >>> >>> What could be setup wrong here? Or can it be avoided to use this ugly >>> cross join at all? I mean my original problem is actually something else ;-) >>> >>> Cheers >>> Wolli >>> >>> >>> 2014-03-05 15:07 GMT+01:00 java8964 <java8...@hotmail.com>: >>> >>> Hi, Wolli: >>>> >>>> Cross join doesn't mean Hive has to use one reduce. >>>> >>>> From query point of view, the following cases will use one reducer: >>>> >>>> 1) Order by in your query (Instead of using sort by) >>>> 2) Only one reducer group, which means all the data have to send to one >>>> reducer, as there is only one reducer group. >>>> >>>> In your case, distinct count of id1 will be the reducer group count. >>>> Did you explicitly set the reducer count in your hive session? >>>> >>>> Yong >>>> >>>> ------------------------------ >>>> Date: Wed, 5 Mar 2014 14:17:24 +0100 >>>> Subject: Best way to avoid cross join >>>> From: darkwoll...@gmail.com >>>> To: user@hive.apache.org >>>> >>>> Hey everyone, >>>> >>>> before i write a lot of text, i just post something which is already >>>> written: >>>> http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx >>>> >>>> The first posts adresses a pretty similar problem i also have. >>>> Currently my implementation looks like this: >>>> >>>> SELECT id1, >>>> MAX( >>>> CASE >>>> WHEN m.keyword IS NULL >>>> THEN 0 >>>> WHEN instr(m.keyword, prep_kw.keyword) > 0 >>>> THEN 1 >>>> ELSE 0 >>>> END) AS flag >>>> FROM (select id1, keyword from import1) m >>>> CROSS JOIN >>>> (SELECT keyword FROM et_keywords) prep_kw >>>> GROUP BY id1; >>>> >>>> Since there is a cross join involved, the execution gets pinned down to >>>> 1 reducer only and it takes ages to complete. >>>> >>>> The thread i posted is solving this with some special SQLserver >>>> tactics. But I was wondering if anybody has encountered the problem in Hive >>>> already and found a better way to solve this. >>>> >>>> I'm using Hive 0.11 on a MapR Distribution, if this is somehow >>>> important. >>>> >>>> Cheers >>>> Wolli >>>> >>> >>> >> >> >> -- >> Nitin Pawar >> > >