[ https://issues.apache.org/jira/browse/DRILL-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Venki Korukanti resolved DRILL-3884. ------------------------------------ Resolution: Fixed > Hive native scan has lower parallelization leading to performance degradation > ----------------------------------------------------------------------------- > > Key: DRILL-3884 > URL: https://issues.apache.org/jira/browse/DRILL-3884 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization, Storage - Hive > Affects Versions: 1.2.0 > Reporter: Venki Korukanti > Assignee: Venki Korukanti > Priority: Critical > Fix For: 1.2.0 > > > Currently {{HiveDrillNativeParquetScan.getScanStats()}} divides the rowCount > got from {{HiveScan}} by a factor and returns that as cost. Problem is all > cost calculations and parallelization depends on the rowCount. Value > {{cpuCost}} is not taken into consideration in current cost calculations in > {{ScanPrel}}. In order for the planner to choose > {{HiveDrillNativeParquetScan}} over {{HiveScan}}, rowCount has to be lowered > for the former, but this leads to lower parallelization and performance > degradation. > Temporary fix for Drill 1.2 before DRILL-3856 fully resolves considering CPU > cost in cost model: > 1. Change ScanPrel to consider the CPU cost in given Stats from GroupScan > 2. Have higher CPU cost for {{HiveScan}} (SerDe route) > 3. Lower CPU cost for {{HiveDrillNativeParquetScan}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)