May be UDF could solve your problem. Regards, Sarfraz Rasheed Ramay (DIT) Dublin, Ireland.
On Mon, Jun 9, 2014 at 7:30 PM, Mark Desnoyer <desno...@neon-lab.com> wrote: > Hi Furcy, > > Thanks for the reply. I looked at MapJoin but it won't do what I need > because all the tables will be large and actually, explicitly going through > the entire table in an n^2 fashion is very inefficient. > > I have large tables, but the intersection is very small. In the Ad Click > case, I would have two streams of data: one for loads and one for clicks. > When a page is rendered, a unique reference id is generated, which gets > tagged to the load and any click that might happen. So, if I had two > tables, one for the loads and one for the clicks, the reference id in each > table could be used to link the click that happened to its associated load. > In other words, for a given reference id, there is probably only one load > and a small number of clicks. So, my join would look some like: > > SELECT <some cols> FROM loads LEFT JOIN clicks ON loads.reference_id = > clicks.reference_id; > > In the MapReduce framework, I could use that reference id as the map > output key in order to collect those events that are associated with each > other and then the reducer would generate the rows by doing the join with > the small number of rows that it received. Technically, you could take this > approach for any equality join because you're using the partitioning step > in order to enforce the join predicate instead of actually evaluating it. > This could be much more efficient in some cases, so I was wondering if Hive > can do it and if so, how can I trigger it? > > Cheers, > Mark > > > On Mon, Jun 9, 2014 at 10:34 AM, Furcy Pin <furcy....@flaminem.com> wrote: > >> Hi Mark, >> >> I'm not sure if I understand what your trying to do correctly, do you >> know the reference id on which you want to do the join beforehand? >> Or is one of your tables small? >> Or are they all big with a small intersection? >> >> I you haven't yet, I would suggest you to have a look at MapJoin: >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins >> Mapjoins allow hive to perform a join on the map side by putting the >> small table in cache. >> (depending on your hive version, this might or might not be enabled by >> default) >> >> Depending on your use case, you might have to pre-filter some tables, >> for that you can use sub-queries like: >> >> FROM >> ( >> SELECT id FROM T WHERE id = myID >> ) T1 >> JOIN T2 ON T1.id=T2.id >> SELECT * >> >> Unlike a standard sql that would simplifies the query, I believe this >> will force Hive to perform the sub-select with a first mapreduce, and apply >> the join in the second >> (as a mapjoin if the result of the subquery is small enough). >> >> Hope this helps, >> >> Furcy >> >> >> >> >> >> >> >> >> >> >> 2014-06-09 18:48 GMT+02:00 Mark Desnoyer <desno...@neon-lab.com>: >> >> Hi, >>> >>> I was wondering if there was a way in Hive to trigger it to perform an >>> efficient equality join on large tables? Specifically, I have two or more >>> tables where the joined key is relatively rare in each table. A good >>> example would be an AdClick scenario where you would have two tables, one >>> for ad loads and one for ad clicks, where there is a reference id to >>> connect a click to a specific ad load and that's the key to join on. >>> >>> In the MapReduce framework, this join could be done efficiently by using >>> the reference id as the key from the map output and then the join would be >>> done in the reducer since there are a small number of rows associated with >>> each reference id. However, hive implements join by buffering all but the >>> last tables in the reducer, and then streams the last table through. This >>> works great if the first tables are relatively small, but if the tables are >>> large and the associated rows are rare, it's very inefficient. >>> >>> So, can hive perform the join the first way I described? If so, any idea >>> how I trigger it? Or do I have to write my own MR job? >>> >>> Cheers, >>> Mark >>> >> >> >