Gopal,

Thanks for this.  Great information and something to look at more closely to 
better understand the internals.

Rick

----- Original Message -----
From: "Gopal Vijayaraghavan" <gop...@apache.org>
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>    * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  
> My understanding was that this was related to the number of buckets, but you 
> are relating it to ORC stripes.  It is odd that no examples that I've seen 
> include the SORTED BY statement other than in relation to ORC indexes (that I 
> understand).  So the question is; regardless of whether efficient ORC stripes 
> are created (wouldn't I have to also specify 'orc.create.index’=’true’ for 
> this to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not 
order in any specific way - the partition key for the shuffle is the modulus, 
while the order key is 0 bytes long, so it sorts by (modulus,) which for a 
quick-sort also loses the input order into the shuffle & each bucket file is 
produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP 
customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing more 
predictable order during shuffle.

Then the key1 can be RLE encoded so that ORC vector impl will pass it on as 
key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for 
integers.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5

was written as a warning to customers who use bucketing to try & solve 
performance problems, but have ended up bucketing as their main problem.

Most of what I have written above was discussed a few years back and in 
general, bucketing on a high cardinality column + sorting on a low cardinality 
together has given good results to my customers.

>    I hadn't thought of the even number issue, not having looked at the 
> function; I had assumed that it was a hash, not a modulus; shame on me.  
> Reading the docs I see that hash is only used on string columns

Actually a hash is used in theory, but I entirely blame Java for it - the Java 
hash is an identity function for Integers.

scala> 42.hashCode
res1: Int = 42

scala> 42L.hashCode
res2: Int = 42

> Finally, I'm not sure that I got a specific answer to my original question, 
> which is can I force Tez to create all bucket files so Presto queries can 
> succeed?  Anyway, I will be testing today and the solution will either be to 
> forgo buckets completely or to simply rely on ORC indexes.

There's no config to do that today & Presto is already incompatible with Hive 
3.0 tables (Update/Delete support).

Presto could fix their fail-safe for bucketing implementation to actually trust 
the Hive bucketing spec & get you out of this mess - the bucketing contract for 
Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).

The file-count is a very flaky way to check if the table is bucketed correctly 
- either you trust the user to have properly bucketed the table or you don't 
use it. Failing to work on valid tables does look pretty bad, instead of soft 
fallbacks.

I wrote a few UDFs which was used to validate suspect tables and fix them for 
customers who had bad historical data, which was loaded with 
"enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.

https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27

LLAP has a bucket pruning implementation if Presto wants to copy from it 
(LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).

Optimizer: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
Runtime: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281

That actually does things according to the Hive bucketing contract where 
uncovered buckets are assumed to have 0 rows without a file present & not error 
out instead.

If you do have the ability to redeploy Hive, the change you are looking for is 
a 1-liner to enable.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248

Cheers,
Gopal


Reply via email to