> I only need to query 3 columns,
...
> The source table is about 1PB.
Format of this table is extremely critical.
A columnar data format like ORC is recommended to avoid reading any other
columns when reading 3 out of 1000.
> Will it be advised to do a subquery first, and then send it to the
>aggregation of group by, so that we have smaller files sending to
>groupby? Not sure it Hive automatically takes care of this.
Hive does column projection after the first scan, so this should not be
necessary - if you do explain logical <query>, you will see
hive> explain logical select l_shipmode, l_shipdate, sum(l_quantity) from
lineitem group by l_shipmode, l_shipdate;
LOGICAL PLAN:
lineitem
TableScan (TS_0)
alias: lineitem
Select Operator (SEL_1)
expressions: l_shipdate (type: string), l_shipmode (type: string),
l_quantity (type: double)
outputColumnNames: l_shipdate, l_shipmode, l_quantity
Group By Operator (GBY_2)
aggregations: sum(l_quantity)
keys: l_shipdate (type: string), l_shipmode (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2
The SEL_1 showing the projection of the 3 columns out of all cols in
lineitem.
Cheers,
Gopal