On 19 July 2016 at 18:31, Gopal Vijayaraghavan wrote:

> > What was the type (Parquet, text, ORC etc) and row count for each three
> >tables above?
> I always use ORC for flat columnar data.
> ORC is designed to be ideal if you have measure/dimensions normalized into
> tables - most SQL workloads don't start with an indefinite depth tree.
> hive> select count(1) from store_sales;
> OK
> 2879987999
> Time taken: 2.603 seconds, Fetched: 1 row(s)
> hive> select count(1) from store;
> OK
> 1002
> Time taken: 0.213 seconds, Fetched: 1 row(s)
> hive> select count(1) from date_dim;
> OK
> 73049
> Time taken: 0.186 seconds, Fetched: 1 row(s)
> hive>
> The DPP semi-join for date_dim is very fast, so out of the ~2.8 billion
> records only 93 million are read into the cache.
> Standard TPC-DS data-set at 1000 scale - same layout you can get from
> hive-testbench && ./ 1000;
