If you are doing a map side join make sure the table members_map is small enough to hold in memory
On 4/24/12, Ruben de Vries <ruben.devr...@hyves.nl> wrote: > Wow thanks everyone for the nice feedback! > > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? > > > Cheers, > > Ruben de Vries > > -----Original Message----- > From: Mark Grover [mailto:mgro...@oanda.com] > Sent: Tuesday, April 24, 2012 3:17 AM > To: user@hive.apache.org; bejoy ks > Cc: Ruben de Vries > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben, > Like Bejoy pointed out, members_map is small enough to fit in memory, so > your joins with visit_stats would be much faster with map-side join. > > However, there is still some virtue in bucketing visit_stats. Bucketing can > optimize joins, group by's and potentially other queries in certain > circumstances. > You probably want to keep consistent bucketing columns across all your > tables so they can leveraged in multi-table queries. Most people use some > power of 2 as their number of buckets. To make the best use of the buckets, > each of your buckets should be able to entirely load into memory on the > node. > > I use something close the formula below to calculate the number of buckets: > > #buckets = (x * Average_partition_size) / > JVM_memory_available_to_your_Hadoop_tasknode > > I call x (>1) the "factor of conservatism". Higher x means you are being > more conservative by having larger number of buckets (and bearing the > increased overhead), lower x means the reverse. What x to use would depend > on your use case. This is because the number of buckets in a table is fixed. > If you have a large partition, it would distribute it's data into bulkier > buckets and you would want to make sure these bulkier buckets can still fit > in memory. Moreover, buckets are generated using a hashing function, if you > have a strong bias towards a particular value of bucketing column in your > data, some buckets might be bulkier than others. In that case, you'd want to > make sure that those bulkier buckets can still fit in memory. > > To summarize, it depends on: > * How the actual partition sizes vary from the average partition size (i.e. > the standard deviation of your partition size). More standard deviations > means you should be more conservative in your calculation and vice-versa. > * Distribution of the data in the bucketing columns. "Wider" distribution > means you should be more conservative and vice-versa. > > Long story short, I would say, x of 2 to 4 should suffice in most cases but > feel free to verify that in your case:-) I would love to hear what factors > others have been using when calculating their number of buckets, BTW! > Whatever answer you get for #buckets from above formula, use the closest > power of 2 as the number of buckets in your table (I am not sure if this is > a must, though). > > Good luck! > > Mark > > Mark Grover, Business Intelligence Analyst OANDA Corporation > > www: oanda.com www: fxtrade.com > e: mgro...@oanda.com > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > ----- Original Message ----- > From: "Bejoy KS" <bejoy...@yahoo.com> > To: "Ruben de Vries" <ruben.devr...@hyves.nl>, user@hive.apache.org > Sent: Monday, April 23, 2012 12:39:17 PM > Subject: Re: When/how to use partitions and buckets usefully? > > If data is in hdfs, then you can bucket it only after loading into a > temp/staging table and then to the final bucketed table. Bucketing needs a > Map reduce job. > > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > > From: Ruben de Vries <ruben.devr...@hyves.nl> > Date: Mon, 23 Apr 2012 18:13:20 +0200 > To: user@hive.apache.org<user@hive.apache.org>; > bejoy...@yahoo.com<bejoy...@yahoo.com> > Subject: RE: When/how to use partitions and buckets usefully? > > > > > Thanks for the help so far guys, > > > > I bucketed the members_map, it’s 330mb in size (11 mil records). > > > > Can you manually bucket stuff? > > Since my initial mapreduce job is still outside of Hive I’m doing a LOAD > DATA to import stuff into the visit_stats tables, replacing that with INSERT > OVERWRITE SELECT slows it down a lot > > > > > > From: Bejoy KS [mailto:bejoy...@yahoo.com] > Sent: Monday, April 23, 2012 6:06 PM > To: user@hive.apache.org > Subject: Re: When/how to use partitions and buckets usefully? > > > > For Bucketed map join, both tables should be bucketed and the number of > buckets of one should be multiple of other. > > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > > > > > From: "Bejoy KS" < bejoy...@yahoo.com > > > > Date: Mon, 23 Apr 2012 16:03:34 +0000 > > > To: < user@hive.apache.org > > > > ReplyTo: bejoy...@yahoo.com > > > Subject: Re: When/how to use partitions and buckets usefully? > > > > > Bucketed map join would be good I guess. What is the total size of the > smaller table and what is its expected size in the next few years? > > The size should be good enough to be put in Distributed Cache, then map side > joins would offer you much performance improvement. > > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > > > > > From: Ruben de Vries < ruben.devr...@hyves.nl > > > > Date: Mon, 23 Apr 2012 17:38:20 +0200 > > > To: user@hive.apache.org<user@hive.apache.org > > > > ReplyTo: user@hive.apache.org > > > Subject: RE: When/how to use partitions and buckets usefully? > > > > > Ok, very clear on the partitions, try to make them match the WHERE clauses, > not so much about group clauses then ;) > > > > The member_map contains 11.636.619 records atm, I think bucketing those > would be good? > > What’s a good number to bucket them by then? > > > > And is there any point in bucketing the visit_stats? > > > > > > From: Tucker, Matt [mailto:matt.tuc...@disney.com] > Sent: Monday, April 23, 2012 5:30 PM > To: user@hive.apache.org > Subject: RE: When/how to use partitions and buckets usefully? > > > > If you’re only interested in a certain window of dates for analysis, a > date-based partition scheme will be helpful, as it will trim partitions that > aren’t needed by the query before execution. > > > > If the member_map table is small, you might consider testing the feasibility > of map-side joins, as it will reduce the number of processing stages. If > member_map is large, bucketing on member_id will avoid having as many rows > from visit_stats compared to each member_id for joins. > > > > > Matt Tucker > > > > > > From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] > Sent: Monday, April 23, 2012 11:19 AM > To: user@hive.apache.org > Subject: When/how to use partitions and buckets usefully? > > > > It seems there’s enough information to be found on how to setup and use > partitions and buckets. > > But I’m more interested in how to figure out when and what columns you > should be partitioning and bucketing to increase performance?! > > > > In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols > which give me info about the visits) and 1 member_map (member_id, gender, > age). > > > > Usually I group by date and then one of the other col so I assume that > partitioning on date is a good start?! > > > > It seems the join of the member_map onto the visit_stats makes the queries a > lot slower, can that be fixed by bucketing both tables? Or just one of them? > > > > > Maybe some ppl have written good blogs on this subject but I can’t really > seem to find them!? > > > > Any help would be appreciated, thanks in advance J > -- Nitin Pawar