[ https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192512#comment-14192512 ]
Prasanth J commented on HIVE-8671: ---------------------------------- The issue is related to ReduceSink's stats annotation rule. It does not rename the column statistics properly. The group by key column names gets overwritten with column stats of aggregate column since both of the have same suffix but different prefix (KEY, VALUE).. This is the reason for the overflow. I am still working on a fix. Will put up a patch shortly. > Overflow in estimate row count and data size with fetch column stats > -------------------------------------------------------------------- > > Key: HIVE-8671 > URL: https://issues.apache.org/jira/browse/HIVE-8671 > Project: Hive > Issue Type: Bug > Components: Physical Optimizer > Affects Versions: 0.14.0 > Reporter: Mostafa Mokhtar > Assignee: Prasanth J > Priority: Critical > Fix For: 0.14.0 > > Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch > > > Overflow in row counts and data size for several TPC-DS queries. > Interestingly the operators which have overflow end up running with a small > parallelism. > For instance Reducer 2 has an overflow but it only runs with parallelism of 2. > {code} > Reducer 2 > Reduce Operator Tree: > Group By Operator > aggregations: sum(VALUE._col0) > keys: KEY._col0 (type: string), KEY._col1 (type: string), > KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Statistics: Num rows: 9223372036854775807 Data size: > 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col3 (type: string), _col3 (type: string) > sort order: ++ > Map-reduce partition columns: _col3 (type: string) > Statistics: Num rows: 9223372036854775807 Data size: > 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: string), _col1 (type: > string), _col2 (type: string), _col3 (type: string), _col4 (type: float), > _col5 (type: double) > Execution mode: vectorized > {code} > {code} > VERTEX TOTAL_TASKS DURATION_SECONDS CPU_TIME_MILLIS > INPUT_RECORDS OUTPUT_RECORDS > Map 1 62 26.41 1,779,510 > 211,978,502 60,628,390 > Map 5 1 4.28 6,950 > 138,098 138,098 > Map 6 1 2.44 3,910 > 31 31 > Reducer 2 2 22.69 61,320 > 60,628,390 69,182 > Reducer 3 1 2.63 3,910 > 69,182 100 > Reducer 4 1 1.01 1,180 > 100 100 > {code} > Query > {code} > explain > select i_item_desc > ,i_category > ,i_class > ,i_current_price > ,i_item_id > ,sum(ws_ext_sales_price) as itemrevenue > ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over > (partition by i_class) as revenueratio > from > web_sales > ,item > ,date_dim > where > web_sales.ws_item_sk = item.i_item_sk > and item.i_category in ('Jewelry', 'Sports', 'Books') > and web_sales.ws_sold_date_sk = date_dim.d_date_sk > and date_dim.d_date between '2001-01-12' and '2001-02-11' > group by > i_item_id > ,i_item_desc > ,i_category > ,i_class > ,i_current_price > order by > i_category > ,i_class > ,i_item_id > ,i_item_desc > ,revenueratio > limit 100 > {code} > Explain > {code} > STAGE PLANS: > Stage: Stage-1 > Tez > Edges: > Map 1 <- Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE) > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > Reducer 4 <- Reducer 3 (SIMPLE_EDGE) > DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: web_sales > filterExpr: ws_item_sk is not null (type: boolean) > Statistics: Num rows: 21594638446 Data size: 2850189889652 > Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ws_item_sk is not null (type: boolean) > Statistics: Num rows: 21594638446 Data size: 172746300152 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: ws_item_sk (type: int), ws_ext_sales_price > (type: float), ws_sold_date_sk (type: int) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 21594638446 Data size: > 172746300152 Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col0} {_col1} > 1 > keys: > 0 _col2 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col0, _col1 > input vertices: > 1 Map 6 > Statistics: Num rows: 24145061366 Data size: > 193160490928 Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col1} > 1 {_col1} {_col2} {_col3} {_col4} {_col5} > keys: > 0 _col0 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col1, _col6, _col7, _col8, > _col9, _col10 > input vertices: > 1 Map 5 > Statistics: Num rows: 25381041158 Data size: > 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col6 (type: string), _col7 (type: > string), _col10 (type: string), _col9 (type: string), _col8 (type: float), > _col1 (type: float) > outputColumnNames: _col0, _col1, _col2, _col3, > _col4, _col5 > Statistics: Num rows: 25381041158 Data size: > 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > aggregations: sum(_col5) > keys: _col0 (type: string), _col1 (type: > string), _col2 (type: string), _col3 (type: string), _col4 (type: float) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3, > _col4, _col5 > Statistics: Num rows: 119291 Data size: 954328 > Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: string), _col1 > (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: > float) > sort order: +++++ > Map-reduce partition columns: _col0 (type: > string), _col1 (type: string), _col2 (type: string), _col3 (type: string), > _col4 (type: float) > Statistics: Num rows: 119291 Data size: > 954328 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col5 (type: double) > Execution mode: vectorized > Map 5 > Map Operator Tree: > TableScan > alias: item > filterExpr: ((i_category) IN ('Jewelry', 'Sports', 'Books') > and i_item_sk is not null) (type: boolean) > Statistics: Num rows: 462000 Data size: 663862160 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ((i_category) IN ('Jewelry', 'Sports', > 'Books') and i_item_sk is not null) (type: boolean) > Statistics: Num rows: 231000 Data size: 109491664 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: i_item_sk (type: int), i_item_id (type: > string), i_item_desc (type: string), i_current_price (type: float), i_class > (type: string), i_category (type: string) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, > _col5 > Statistics: Num rows: 231000 Data size: 109491664 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 231000 Data size: 109491664 > Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col1 (type: string), _col2 (type: > string), _col3 (type: float), _col4 (type: string), _col5 (type: string) > Execution mode: vectorized > Map 6 > Map Operator Tree: > TableScan > alias: date_dim > filterExpr: (d_date BETWEEN '2001-01-12' AND '2001-02-11' > and d_date_sk is not null) (type: boolean) > Statistics: Num rows: 73049 Data size: 81741831 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (d_date BETWEEN '2001-01-12' AND '2001-02-11' > and d_date_sk is not null) (type: boolean) > Statistics: Num rows: 36524 Data size: 3579352 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: d_date_sk (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 146096 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 36524 Data size: 146096 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col0 (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 0 Basic stats: > PARTIAL Column stats: COMPLETE > Group By Operator > keys: _col0 (type: int) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 0 Basic > stats: PARTIAL Column stats: COMPLETE > Dynamic Partitioning Event Operator > Target Input: web_sales > Partition key expr: ws_sold_date_sk > Statistics: Num rows: 36524 Data size: 0 Basic > stats: PARTIAL Column stats: COMPLETE > Target column: ws_sold_date_sk > Target Vertex: Map 1 > Execution mode: vectorized > Reducer 2 > Reduce Operator Tree: > Group By Operator > aggregations: sum(VALUE._col0) > keys: KEY._col0 (type: string), KEY._col1 (type: string), > KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Statistics: Num rows: 119291 Data size: 1908656 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col3 (type: string), _col3 (type: string) > sort order: ++ > Map-reduce partition columns: _col3 (type: string) > Statistics: Num rows: 119291 Data size: 1908656 Basic > stats: COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: string), _col1 (type: > string), _col2 (type: string), _col3 (type: string), _col4 (type: float), > _col5 (type: double) > Execution mode: vectorized > Reducer 3 > Reduce Operator Tree: > Extract > Statistics: Num rows: 119291 Data size: 1908656 Basic stats: > COMPLETE Column stats: COMPLETE > PTF Operator > Statistics: Num rows: 119291 Data size: 1908656 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col1 (type: string), _col2 (type: string), > _col3 (type: string), _col4 (type: float), _col0 (type: string), _col5 (type: > double), ((_col5 * 100.0) / _wcol0) (type: double) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, > _col5, _col6 > Statistics: Num rows: 119291 Data size: 954328 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col1 (type: string), _col2 (type: > string), _col4 (type: string), _col0 (type: string), _col6 (type: double) > sort order: +++++ > Statistics: Num rows: 119291 Data size: 954328 Basic > stats: COMPLETE Column stats: COMPLETE > TopN Hash Memory Usage: 0.04 > value expressions: _col3 (type: float), _col5 (type: > double) > Reducer 4 > Reduce Operator Tree: > Select Operator > expressions: KEY.reducesinkkey3 (type: string), > KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string), > VALUE._col0 (type: float), KEY.reducesinkkey2 (type: string), VALUE._col1 > (type: double), KEY.reducesinkkey4 (type: double) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, > _col6 > Statistics: Num rows: 119291 Data size: 954328 Basic stats: > COMPLETE Column stats: COMPLETE > Limit > Number of rows: 100 > Statistics: Num rows: 100 Data size: 800 Basic stats: > COMPLETE Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 100 Data size: 800 Basic stats: > COMPLETE Column stats: COMPLETE > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Execution mode: vectorized > Stage: Stage-0 > Fetch Operator > limit: 100 > Processor Tree: > ListSink > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)