Re: Hive, Tez, clustering, buckets, and Presto

Gopal Vijayaraghavan Sun, 01 Apr 2018 23:17:28 -0700

There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS


I hope the epoch partition column is actually a day rollup and not 1 partition 
for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). 
CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when 
> the partition is created there are always 64 bucket files, even if some are 0 
> length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) 
values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are 
> created, and Presto fails since the table properties in the metastore (64 
> buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real 
money to store (because storage is usually at the 4k block level, this uses 
1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table 
(in the >64 case, at least some files have a _Copy suffix, but the bucket 
prefix is consistent).

Presto just bails out when it finds something strange, because they do not 
trust the Hive bucketing impl - there is a really good reason why there are 2 
bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & 
hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can 
> contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want 
to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is 
> always set to "true", but I can't seem to find a description of its intent.  
> Anyway, if it's supposed to force the creation of buckets it's not working 
> with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data 
won't use the fixed number of buckets & can play other tricks with the data 
layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit 
called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth 
explaining right now (but just that it is possible to use it, but not very 
easily).

> configuration so that r_crs_id was at least row optimized and sorted within 
> the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim 
sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets 
have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - 
https://issues.apache.org/jira/browse/HIVE-7074

Cheers,
Gopal

Re: Hive, Tez, clustering, buckets, and Presto

Reply via email to