In our case we’ve chose 128 buckets, but that’s just an arbitrary figure we’ve
chosen to get a good even distribution
To fix the issue we were having with the small file we just updated the setting
hive.exec.max.dynamic.partitions.pernode to 1, that way if we do run a tiny
file (very rarely) which only allocates one reducer – we can be sure we don’t
run into this issue again
With thanks,
Daniel Harper
Software Engineer, OTG ANT
BC5 A5
From: Mich Talebzadeh m...@peridale.co.ukmailto:m...@peridale.co.uk
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, 17 April 2015 10:18
To: user@hive.apache.orgmailto:user@hive.apache.org
user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: [Hive 0.13.1] - Explanation/confusion over Fatal error occurred
when node tried to create too many dynamic partitions on small dataset with
dynamic partitions
Hi Lefty,
I took a look at the documentation link and I noticed that it can be improved.
For example the paragraph below:
“How does Hive distribute the rows across the buckets? In general, the bucket
number is determined by the expression hash_function(bucketing_column) mod
num_buckets. (There's a '0x7FFF in there too, but that's not that
important). The hash_function depends on the type of the bucketing column. For
an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and
there were 10 buckets, we would expect all user_id's that end in 0 to be in
bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other
datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the
same as the BIGINT. And the hash of a string or a complex datatype will be some
number that's derived from the value, but not anything humanly-recognizable.
For example, if user_id were a STRING, then the user_id's in bucket 1 would
probably not end in 0. In general, distributing rows based on the hash will
give you a even distribution in the buckets.
So, what can go wrong? As long as you set hive.enforce.bucketing = true, and
use the syntax above, the tables should be populated properly. Things can go
wrong if the bucketing column type is different during the insert and on read,
or if you manually cluster by a value that's different from the table
definition.”
So in a nutshell num_buckets determines the granularity of hashing and the
number of files. So eventually the table will have in total number_partitions x
num_buckets files. The example mentions (not shown above) 256 buckets but that
is just a number.
It also states “For example, …and there were 10 buckets”. This is not standard.
In a nutshell bucketing is a method to get data “evenly distributed” over many
files. Thus, one should define the number of num_buckets by a power of two --
2^n, like 2, 4, 8, 16 etc to achieve best results and getting best clustering.
I will try to see the upper limits on the number of buckets within a partition
and will get back on that.
HTH
Mich Talebzadeh
http://talebzadehmich.wordpress.com
Author of the books A Practitioner’s Guide to Upgrading to Sybase ASE 15,
ISBN 978-0-9563693-0-7.
co-author Sybase Transact SQL Guidelines Best Practices, ISBN
978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one
out shortly
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries
or their employees, unless expressly so stated. It is the responsibility of the
recipient to ensure that this email is virus free, therefore neither Peridale
Ltd, its subsidiaries nor their employees accept any responsibility.
From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: 17 April 2015 00:06
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: [Hive 0.13.1] - Explanation/confusion over Fatal error occurred
when node tried to create too many dynamic partitions on small dataset with
dynamic partitions
If the number of buckets in a partitioned table has a limit, we need to
document it in the wiki. Currently the
examplehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
shows 256 buckets.
-- Lefty
On Thu, Apr 16, 2015 at 4:35 AM, Daniel Harper
daniel.har...@bbc.co.ukmailto:daniel.har...@bbc.co.uk wrote:
As in you can only have 32 buckets (rather than 128 in our case?)
With thanks,
Daniel Harper
Software Engineer, OTG ANT
BC5 A5
From: Mich Talebzadeh m...@peridale.co.ukmailto:m...@peridale.co.uk
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org
user@hive.apache.orgmailto:user@hive.apache.org
Date