There's more here than Bucketing or Tez. > PARTITIONED BY(daydate STRING, epoch BIGINT) > CLUSTERED BY(r_crs_id) INTO 64 BUCKETS
I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp. CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). CLUSTERED BY needs a SORTED BY to produce good ORC stripes here. > If I perform the inserts with the mr engine, which is now deprecated, when > the partition is created there are always 64 bucket files, even if some are 0 > length. In this case I can query with Presto just fine. The files should be at least 3 bytes long containing the 3 letters 'O','R','C'. If you have a bucketing impl which is producing empty files when hash(20k) values % 64 is producing 0 rows for some buckets, the skew is unnatural. > However, when the Tez engine is used, only buckets that contain data are > created, and Presto fails since the table properties in the metastore (64 > buckets) are a mismatch with the actual bucket count. That was done to prevent creating these 3 byte files which actually cost real money to store (because storage is usually at the 4k block level, this uses 1000x more storage). You can end up with >64 or <64 or exactly 64 files for a validly bucketed table (in the >64 case, at least some files have a _Copy suffix, but the bucket prefix is consistent). Presto just bails out when it finds something strange, because they do not trust the Hive bucketing impl - there is a really good reason why there are 2 bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & hive.tez.bucket.pruning.compat). > we'd really like to use buckets, since the the r_crs_id in production can > contain 20,000 values. With 20k values and 64 buckets, if you have zero sized buckets - you might want to run your hash values through this. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6 > "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is > always set to "true", but I can't seem to find a description of its intent. > Anyway, if it's supposed to force the creation of buckets it's not working > with the Tez engine. Previously you could disable "enforce.bucketing=false" and the inserts of data won't use the fixed number of buckets & can play other tricks with the data layout. That's not the config you're looking for. There is something which is a very sharp no-handle knife in the Hive toolkit called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth explaining right now (but just that it is possible to use it, but not very easily). > configuration so that r_crs_id was at least row optimized and sorted within > the ORC files. SORTED BY in the Table DDL should do the trick - I like to use a multi-dim sort, in some of these scenarios. CLUSTERED BY(r_crs_id) SORTED BY(r_crs_id, id) INTO 67 BUCKETS If you want to know why I like primes, when % 64 is done on even numbers. len(set([(x*2) % 64 for x in xrange(1000)])) fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets have 0x data. len(set([(x*2) % 61 for x in xrange(1000)])) fills all 61 buckets - producing better bucketing & no 0 sized files. FYI any prime number other than 31 works nicer than a 2^n - https://issues.apache.org/jira/browse/HIVE-7074 Cheers, Gopal