Thanks for the tip Gopal.
I tried what you suggested (on Tez) but I'm getting a middle stage with 1
reducer (which is awful for performance).

This is my query:
insert into upstreamparam_org partition(day_ts, cmtsid) select * from
upstreamparam_20151013 order by datats,macaddress;

I've attached the query plan in case it might help understand why.

Thank you.
Daniel.




On Fri, Oct 16, 2015 at 7:19 PM, Gopal Vijayaraghavan <[email protected]>
wrote:

>
> > Is there a more efficient way to have Hive merge small files on the
> >files without running with two passes?
>
> Not entirely an efficient way, but adding a shuffle stage usually works
> much better as it gives you the ability to layout the files for better
> vectorization.
>
> Like for TPC-H, doing ETL with
>
> create table lineitem as select * from lineitem sort by l_shipdate,
> l_suppkey;
>
> will produce fewer files (exactly as many as your reducer #) & compresses
> harder due to the natural order of transactions (saves ~20Gb or so at 1000
> scale).
>
> Caveat: that is not more efficient in MRv2, only in Tez/Spark which can
> run MRR pipelines as-is.
>
> Cheers,
> Gopal
>
>
>
Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-3
   Stats-Aggr Operator
      Stage-0
         Move Operator
            partition:{}
            table:{"name:":"default.upstreamparam_org","input 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
            Stage-2
               Dependency Collection{}
                  Stage-1
                     Reducer 2
                     File Output Operator [FS_5]
                        compressed:false
                        Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE
                        table:{"name:":"default.upstreamparam_org","input 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
                        Select Operator [SEL_3]
                        |  
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17","_col18","_col19","_col20"]
                        |  Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE
                        |<-Map 1 [SIMPLE_EDGE]
                           Reduce Output Operator [RS_7]
                              key expressions:_col1 (type: bigint), _col0 
(type: bigint)
                              sort order:++
                              Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE
                              value expressions:_col2 (type: bigint), _col3 
(type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: float), 
_col7 (type: float), _col8 (type: float), _col9 (type: float), _col10 (type: 
float), _col11 (type: float), _col12 (type: float), _col13 (type: float), 
_col14 (type: float), _col15 (type: float), _col16 (type: bigint), _col17 
(type: bigint), _col18 (type: bigint), _col19 (type: bigint), _col20 (type: 
string)
                              Select Operator [OP_6]
                                 
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17","_col18","_col19","_col20"]
                                 Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE
                                 TableScan [TS_0]
                                    alias:upstreamparam_20151013
                                    Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE

Reply via email to