Hi Saurabh.

 

Bucketing in Hive refers to hash partitioning where a hashing function is
applied. Likewise an RDBMS, Hive will apply a linear hashing algorithm to
prevent data from clustering within specific partitions. Hashing is very
effective if the column selected for bucketing has very high selectivity
like an ID column where selectivity (select
count(distinct(column))/count(column) ) = 1.  In this case, the created
partitions/ files will be as evenly sized as possible. In a nutshell
bucketing is a method to get data evenly distributed over many
partitions/files.  One should define the number of buckets by a power of two
-- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing will
help concurrency in Hive. It may even allow a partition wise join i.e. a
join between two tables that are bucketed on the same column with the same
number of buckets (anyone has tried this?).

 

You need to work out the selectivity of column you are using for bucketing
using the above formulae or something similar then decide on the number of
buckets say clustered by (object_id) into 256 buckets.

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

-----Original Message-----
From: Db-Blog [mailto:mpp.databa...@gmail.com] 
Sent: 06 September 2015 21:22
To: user@hive.apache.org
Subject: Re: Bucketing- Identify Number of Buckets

 

Details of Hive Version:

I am using Hive -14.0 with Tez as execution engine. 

 

Thanks,

Saurabh

 

Sent from my iPhone, please avoid typos.

 

> On 07-Sep-2015, at 1:51 am, Db-Blog < <mailto:mpp.databa...@gmail.com>
mpp.databa...@gmail.com> wrote:

> 

> Hi, 

> 

> I need to join two big tables in hive. The join key is the grain of both
these tables, hence clustering and sorting on the same will provide
significant performance optimisation while joining.  

> 

> However, i am not sure how to calculate the exact number of buckets while
creating these tables. Can someone please share any pointers on the same? 

> 

> Planning to keep these Clustered and Sorted tables as parquet/orc- for
columnar storage and better compression. 

> 

> Thanks,

> Saurabh

> 

> Sent from my iPhone, please avoid typos.

Reply via email to