If you are doing a map side join make sure the table members_map is
small enough to hold in memory

On 4/24/12, Ruben de Vries <ruben.devr...@hyves.nl> wrote:
> Wow thanks everyone for the nice feedback!
>
> I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right?
>
>
> Cheers,
>
> Ruben de Vries
>
> -----Original Message-----
> From: Mark Grover [mailto:mgro...@oanda.com]
> Sent: Tuesday, April 24, 2012 3:17 AM
> To: user@hive.apache.org; bejoy ks
> Cc: Ruben de Vries
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben,
> Like Bejoy pointed out, members_map is small enough to fit in memory, so
> your joins with visit_stats would be much faster with map-side join.
>
> However, there is still some virtue in bucketing visit_stats. Bucketing can
> optimize joins, group by's and potentially other queries in certain
> circumstances.
> You probably want to keep consistent bucketing columns across all your
> tables so they can leveraged in multi-table queries. Most people use some
> power of 2 as their number of buckets. To make the best use of the buckets,
> each of your buckets should be able to entirely load into memory on the
> node.
>
> I use something close the formula below to calculate the number of buckets:
>
> #buckets = (x * Average_partition_size) /
> JVM_memory_available_to_your_Hadoop_tasknode
>
> I call x (>1) the "factor of conservatism". Higher x means you are being
> more conservative by having larger number of buckets (and bearing the
> increased overhead), lower x means the reverse. What x to use would depend
> on your use case. This is because the number of buckets in a table is fixed.
> If you have a large partition, it would distribute it's data into bulkier
> buckets and you would want to make sure these bulkier buckets can still fit
> in memory. Moreover, buckets are generated using a hashing function, if you
> have a strong bias towards a particular value of bucketing column in your
> data, some buckets might be bulkier than others. In that case, you'd want to
> make sure that those bulkier buckets can still fit in memory.
>
> To summarize, it depends on:
> * How the actual partition sizes vary from the average partition size (i.e.
> the standard deviation of your partition size). More standard deviations
> means you should be more conservative in your calculation and vice-versa.
> * Distribution of the data in the bucketing columns. "Wider" distribution
> means you should be more conservative and vice-versa.
>
> Long story short, I would say, x of 2 to 4 should suffice in most cases but
> feel free to verify that in your case:-) I would love to hear what factors
> others have been using when calculating their number of buckets, BTW!
> Whatever answer you get for #buckets from above formula, use the closest
> power of 2 as the number of buckets in your table (I am not sure if this is
> a must, though).
>
> Good luck!
>
> Mark
>
> Mark Grover, Business Intelligence Analyst OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgro...@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "Bejoy KS" <bejoy...@yahoo.com>
> To: "Ruben de Vries" <ruben.devr...@hyves.nl>, user@hive.apache.org
> Sent: Monday, April 23, 2012 12:39:17 PM
> Subject: Re: When/how to use partitions and buckets usefully?
>
> If data is in hdfs, then you can bucket it only after loading into a
> temp/staging table and then to the final bucketed table. Bucketing needs a
> Map reduce job.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> From: Ruben de Vries <ruben.devr...@hyves.nl>
> Date: Mon, 23 Apr 2012 18:13:20 +0200
> To: user@hive.apache.org<user@hive.apache.org>;
> bejoy...@yahoo.com<bejoy...@yahoo.com>
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
>
> Thanks for the help so far guys,
>
>
>
> I bucketed the members_map, it’s 330mb in size (11 mil records).
>
>
>
> Can you manually bucket stuff?
>
> Since my initial mapreduce job is still outside of Hive I’m doing a LOAD
> DATA to import stuff into the visit_stats tables, replacing that with INSERT
> OVERWRITE SELECT slows it down a lot
>
>
>
>
>
> From: Bejoy KS [mailto:bejoy...@yahoo.com]
> Sent: Monday, April 23, 2012 6:06 PM
> To: user@hive.apache.org
> Subject: Re: When/how to use partitions and buckets usefully?
>
>
>
> For Bucketed map join, both tables should be bucketed and the number of
> buckets of one should be multiple of other.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
>
>
>
> From: "Bejoy KS" < bejoy...@yahoo.com >
>
>
> Date: Mon, 23 Apr 2012 16:03:34 +0000
>
>
> To: < user@hive.apache.org >
>
>
> ReplyTo: bejoy...@yahoo.com
>
>
> Subject: Re: When/how to use partitions and buckets usefully?
>
>
>
>
> Bucketed map join would be good I guess. What is the total size of the
> smaller table and what is its expected size in the next few years?
>
> The size should be good enough to be put in Distributed Cache, then map side
> joins would offer you much performance improvement.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
>
>
>
> From: Ruben de Vries < ruben.devr...@hyves.nl >
>
>
> Date: Mon, 23 Apr 2012 17:38:20 +0200
>
>
> To: user@hive.apache.org<user@hive.apache.org >
>
>
> ReplyTo: user@hive.apache.org
>
>
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
>
> Ok, very clear on the partitions, try to make them match the WHERE clauses,
> not so much about group clauses then ;)
>
>
>
> The member_map contains 11.636.619 records atm, I think bucketing those
> would be good?
>
> What’s a good number to bucket them by then?
>
>
>
> And is there any point in bucketing the visit_stats?
>
>
>
>
>
> From: Tucker, Matt [mailto:matt.tuc...@disney.com]
> Sent: Monday, April 23, 2012 5:30 PM
> To: user@hive.apache.org
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
> If you’re only interested in a certain window of dates for analysis, a
> date-based partition scheme will be helpful, as it will trim partitions that
> aren’t needed by the query before execution.
>
>
>
> If the member_map table is small, you might consider testing the feasibility
> of map-side joins, as it will reduce the number of processing stages. If
> member_map is large, bucketing on member_id will avoid having as many rows
> from visit_stats compared to each member_id for joins.
>
>
>
>
> Matt Tucker
>
>
>
>
>
> From: Ruben de Vries [mailto:ruben.devr...@hyves.nl]
> Sent: Monday, April 23, 2012 11:19 AM
> To: user@hive.apache.org
> Subject: When/how to use partitions and buckets usefully?
>
>
>
> It seems there’s enough information to be found on how to setup and use
> partitions and buckets.
>
> But I’m more interested in how to figure out when and what columns you
> should be partitioning and bucketing to increase performance?!
>
>
>
> In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols
> which give me info about the visits) and 1 member_map (member_id, gender,
> age).
>
>
>
> Usually I group by date and then one of the other col so I assume that
> partitioning on date is a good start?!
>
>
>
> It seems the join of the member_map onto the visit_stats makes the queries a
> lot slower, can that be fixed by bucketing both tables? Or just one of them?
>
>
>
>
> Maybe some ppl have written good blogs on this subject but I can’t really
> seem to find them!?
>
>
>
> Any help would be appreciated, thanks in advance J
>


-- 
Nitin Pawar

Reply via email to