Yeah, my abridged version of query might be a little broken but my point is that when a query has a map join and group by, even in its simplified incarnation, it will launch two jobs. I was just wondering why map join and group by cannot be accomplished in one MR job.
Best, Chen On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <nitinpawar...@gmail.com>wrote: > I think Chen wanted to know why this is two phased query if I understood > it correctly > > When you run a mapside join .. it just performs the join query .. after > that to execute the group by part it launches the second job. > I may be wrong but this is how I saw it whenever I executed group by > queries > > > On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover > <grover.markgro...@gmail.com>wrote: > >> Hi Chen, >> I think we would need some more information. >> >> The query is referring to a table called "d" in the MAPJOIN hint but >> there is not such table in the query. Moreover, Map joins only make >> sense when the right table is the one being "mapped" (in other words, >> being kept in memory) in case of a Left Outer Join, similarly if the >> left table is the one being "mapped" in case of a Right Outer Join. >> Let me know if this is not clear, I'd be happy to offer a better >> explanation. >> >> In your query, the where clause on a column called "hour", at this >> point I am unsure if that's a column of table1 or table2. If it's >> column on table1, that predicate would get pushed up (if you have >> hive.optimize.ppd property set to true), so it could possibly be done >> in 1 MR job (I am not sure if that's presently the case, you will have >> to check the explain plan). If however, the where clause is on a >> column in the right table (table2 in your example), it can't be pushed >> up since a column of the right table can have different values before >> and after the LEFT OUTER JOIN. Therefore, the where clause would need >> to be applied in a separate MR job. >> >> This is just my understanding, the full proof answer would lie in >> checking out the explain plans and the Semantic Analyzer code. >> >> And for completeness, there is a conditional task (starting Hive 0.7) >> that will convert your joins automatically to map joins where >> applicable. This can be enabled by enabling hive.auto.convert.join >> property. >> >> Mark >> >> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <chen.song...@gmail.com> >> wrote: >> > I have a silly question on how Hive interpretes a simple query with >> both map >> > side join and group by. >> > >> > Below query will translate into two jobs, with the 1st one as a map >> only job >> > doing the join and storing the output in a intermediary location, and >> the >> > 2nd one as a map-reduce job taking the output of the 1st job as input >> and >> > doing the group by. >> > >> > SELECT >> > /*+ MAPJOIN(d) */ >> > table.a, sum(table2.b) >> > from table >> > LEFT OUTER JOIN table2 >> > ON table.id = table2.id >> > where hour = '2012-12-11 11' >> > group by table.a >> > >> > Why can't this be done within a single map reduce job? As what I can see >> > from the query plan is that all 2nd job mapper do is taking the 1st >> job's >> > mapper output. >> > >> > -- >> > Chen Song >> > >> > >> > > > > -- > Nitin Pawar > -- Chen Song