Gopal, Thanks for this. Great information and something to look at more closely to better understand the internals.
Rick ----- Original Message ----- From: "Gopal Vijayaraghavan" <gop...@apache.org> To: user@hive.apache.org Sent: Tuesday, April 3, 2018 3:15:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto > * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. > My understanding was that this was related to the number of buckets, but you > are relating it to ORC stripes. It is odd that no examples that I've seen > include the SORTED BY statement other than in relation to ORC indexes (that I > understand). So the question is; regardless of whether efficient ORC stripes > are created (wouldn't I have to also specify 'orc.create.index’=’true’ for > this to have much of an effect) ORC + bucketing has been something I've spent a lot of time with - a lot of this has to do with secondary characteristics of data (i.e same device has natural progressions for metrics), which when combined with a columnar format & ordering within files produces better storage and runtimes together (which I guess is usually a trade-off). Without a SORTED BY, the organizing function for the data-shuffle does not order in any specific way - the partition key for the shuffle is the modulus, while the order key is 0 bytes long, so it sorts by (modulus,) which for a quick-sort also loses the input order into the shuffle & each bucket file is produced in random order within itself. An explicit sort with bucketing is what I recommend to most of the HDP customers who have performance problems with ORC. This turns the shuffle key into (modulus, key1, key2) producing more predictable order during shuffle. Then the key1 can be RLE encoded so that ORC vector impl will pass it on as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for integers. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5 was written as a warning to customers who use bucketing to try & solve performance problems, but have ended up bucketing as their main problem. Most of what I have written above was discussed a few years back and in general, bucketing on a high cardinality column + sorting on a low cardinality together has given good results to my customers. > I hadn't thought of the even number issue, not having looked at the > function; I had assumed that it was a hash, not a modulus; shame on me. > Reading the docs I see that hash is only used on string columns Actually a hash is used in theory, but I entirely blame Java for it - the Java hash is an identity function for Integers. scala> 42.hashCode res1: Int = 42 scala> 42L.hashCode res2: Int = 42 > Finally, I'm not sure that I got a specific answer to my original question, > which is can I force Tez to create all bucket files so Presto queries can > succeed? Anyway, I will be testing today and the solution will either be to > forgo buckets completely or to simply rely on ORC indexes. There's no config to do that today & Presto is already incompatible with Hive 3.0 tables (Update/Delete support). Presto could fix their fail-safe for bucketing implementation to actually trust the Hive bucketing spec & get you out of this mess - the bucketing contract for Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile). The file-count is a very flaky way to check if the table is bucketed correctly - either you trust the user to have properly bucketed the table or you don't use it. Failing to work on valid tables does look pretty bad, instead of soft fallbacks. I wrote a few UDFs which was used to validate suspect tables and fix them for customers who had bad historical data, which was loaded with "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945. https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27 LLAP has a bucket pruning implementation if Presto wants to copy from it (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD). Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236 Runtime: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281 That actually does things according to the Hive bucketing contract where uncovered buckets are assumed to have 0 rows without a file present & not error out instead. If you do have the ability to redeploy Hive, the change you are looking for is a 1-liner to enable. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248 Cheers, Gopal