Hi Stephen, My query is actually more complex , hive will generate 2 mapreduces, in the first solution , it runs 17 mappers / 30 reducers and 10 mappers / 30 reducers (reducer num is set manually) in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1 reducers for each partition
I do not know whether they could achieve the same performance if the reducers num is set properly. 2013/6/29 Stephen Sprague <sprag...@gmail.com> > great question. your parallelization seems to trump hadoop's. I guess > i'd ask what are the _total_ number of Mappers and Reducers that run on > your cluster for these two scenarios? I'd be curious if there are the > same. > > > > > On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 <ygnhz...@gmail.com> wrote: > >> Hi all, >> >> Here is the scenario, suppose I have 2 tables A and B, I would like to >> perform a simple join on them, >> >> We can do it like this: >> >> INSERT OVERWRITE TABLE C >> SELECT .... FROM A JOIN B on A.id=B.id >> >> In order to speed up this query since table A and B have lots of data, >> another approach is : >> >> Say I partition table A and B into 10 partitions respectively, and write >> the query like this >> >> INSERT OVERWRITE TABLE C PARTITION(pid=1) >> SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 >> >> then I run this query 10 times concurrently (pid ranges from 1 to 10) >> >> And my question is that , in my observation of some more complex queries, >> the second solution is about 15% faster than the first solution, >> is it simply because the setting of reducer num is not optimal? >> If the resource is not a limit and it is possible to set the proper >> reducer nums in the first solution , can they achieve the same performance? >> Is there any other fact that can cause performance difference between >> them(non-partition VS partition+concurrent) besides the job parameter >> issues? >> >> Thanks! >> > >