Hive, Tez, clustering, buckets, and Presto

Richard A. Bross Sat, 31 Mar 2018 08:18:02 -0700

Hello there, 

I've done a ton of reading and testing between Hive, Presto, Tez, etc. In our 
test environment I have 8 tables being loaded with Hive that works fine under 
certain circumstances.


Most of our data is taken from JSON files in from S3, creates a temporary 
external table, and then performs an INSERT OVERWRITE to transform the data to 
ORC format in Hive managed tables. 

A simple example of the CREATE statement for the internal table looks like this 
(I've left out a number of irrelevant columns): 

CREATE TABLE IF NOT EXISTS measurements (
    id STRING,
    session STRING,
    tt_ts INT,
    r_crs_id INT,
    user_agent STRING
)
PARTITIONED BY(daydate STRING, epoch BIGINT)
CLUSTERED BY(r_crs_id) INTO 64 BUCKETS
STORED AS ORC
LOCATION's3://warehouse/'
tblproperties ("orc.compress"="ZLIB");


If I perform the inserts with the mr engine, which is now deprecated, when the 
partition is created there are always 64 bucket files, even if some are 0 
length.  In this case I can query with Presto just fine.

However, when the Tez engine is used, only buckets that contain data are 
created, and Presto fails since the table properties in the metastore (64 
buckets) are a mismatch with the actual bucket count.

I've seen some other postings in StackExchange and via Google where others have 
run into this.  We really need to use Presto to query Hive, and we'd really 
like to use buckets, since the the r_crs_id in production can contain 20,000 
values.

So a few questions (thanks for any help):

The easiest fix for us would be a way to force Tez to create all buckets like 
mr so the behavior didn't change. Is that possible? "hive.enforce.bucketing" 
appears to have been deprecated as of 2.0 and is always set to "true", but I 
can't seem to find a description of its intent.  Anyway, if it's supposed to 
force the creation of buckets it's not working with the Tez engine.

ALternately, I was thinking that maybe someone could help me with the ORC file 
configuration so that r_crs_id was at least row optimized and sorted within the 
ORC files.  I've been reading up on ORC and the Hive statements related to ORC 
files but I seem to have hit my limit of understanding.  It appears that I 
should at the least have the rows sorted by r_crs_id when written, but I'm 
having problems figuring out how to apply that with the INSERT statement.

It appears that the ORC table should be created with 'orc.create.index’=’true’' 
for a start.

Currently our INSERT looks like this (we never UPDATE or DELETE):

FROM raw_measurements_external INSERT OVERWRITE TABLE raw_measurements 
PARTITION(daydate='{2}', epoch={3}) SELECT *;

Something like this?

FROM raw_measurements_external SORT BY r_crs_id INSERT OVERWRITE TABLE 
raw_measurements PARTITION(daydate='{2}', epoch={3}) SELECT *;

Thanks for any help from more experienced Hive developers (most of you).

Hive, Tez, clustering, buckets, and Presto

Reply via email to