[jira] [Updated] (HIVE-16290) Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when minValue == filterValue
[ https://issues.apache.org/jira/browse/HIVE-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated HIVE-16290: Fix Version/s: (was: 2.3.0) 3.0.0 > Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when > minValue == filterValue > - > > Key: HIVE-16290 > URL: https://issues.apache.org/jira/browse/HIVE-16290 > Project: Hive > Issue Type: Bug > Components: Statistics >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan >Priority: Minor > Fix For: 3.0.0 > > Attachments: HIVE-16290.1.patch, HIVE-16290.2.patch > > > Issue: > = > In {{StatsRulesProcFactory::evaluateCompator}}, when {{minValue}} is >= > filtered {{value}}, it should return all rows. Currently, it returns > {{numRows/3}}. This causes lesser number of reducers to be spun up in > queries. E.g Q79 in TPC-DS. > E.g: TPC-DS store table stats: > = > {noformat} > hive --orcfiledump > hdfs://nn:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_1000.db/store/00_0 > Stripe Statistics: > Stripe 1: > Column 0: count: 1002 hasNull: false > Column 1: count: 1002 hasNull: false min: 1 max: 1002 sum: 502503 > Column 2: count: 1002 hasNull: false min: AABA max: > PPBA sum: 16032 > Column 3: count: 1002 hasNull: false min: max: 2001-03-13 sum: 9950 > Column 4: count: 1002 hasNull: false min: max: 2001-03-12 sum: 5010 > Column 5: count: 273 hasNull: true min: 2450820 max: 2451313 sum: > 669141525 > Column 6: count: 1002 hasNull: false min: max: pri sum: 3916 > Column 7: count: 994 hasNull: true min: 200 max: 300 sum: 249970 > Column 8: count: 996 hasNull: true min: 5002549 max: 9997773 sum: > 7382689071 > Column 9: count: 1002 hasNull: false min: max: 8AM-8AM sum: 7088 > select compute_stats(s_employee_count, 16) from store; > {"columntype":"Long","min":200,"max":300,"countnulls":8,"numdistinctvalues":63,"ndvbitvector":"{0, > 1, 2, 3, 4, 5, 11, 12}{0, 1, 2, 3, 6}{0, 1, 2, 3, 4, 5, 7, 11}{0, 1, 2, 3, > 4, 5, 7}{0, 1, 2, 3, 4, 5, 6}{0, 1, 2, 3, 4, 5, 8}{0, 1, 2, 3, 4}{0, 1, 2, 3, > 4, 5, 7, 9}{0, 1, 2, 3, 4}{0}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6, 7}{0, > 1, 2, 3, 4, 8, 9, 14}{0, 1, 2, 3, 5}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, > 5, 6, 8}"} > {noformat} > {noformat} > explain select count(s_store_sk) from store where s_number_employees > 200 > and s_number_employees < 295; > {noformat} > Above query would first apply 1002/3 = 334 for {{s_number_employees > 200}} > and then 334 / 3 = 111 for {{s_number_employees < 295}}. Ideally it should > return all 1002 rows for filter {{s_number_employees > 200}}. > In TPC-DS Q79, this causes too less reduce tasks to be spun up causing > runtime delays. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVE-16290) Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when minValue == filterValue
[ https://issues.apache.org/jira/browse/HIVE-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-16290: Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.3.0 Status: Resolved (was: Patch Available) Thanks [~gopalv]. Committed to master. > Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when > minValue == filterValue > - > > Key: HIVE-16290 > URL: https://issues.apache.org/jira/browse/HIVE-16290 > Project: Hive > Issue Type: Bug > Components: Statistics >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan >Priority: Minor > Fix For: 2.3.0 > > Attachments: HIVE-16290.1.patch, HIVE-16290.2.patch > > > Issue: > = > In {{StatsRulesProcFactory::evaluateCompator}}, when {{minValue}} is >= > filtered {{value}}, it should return all rows. Currently, it returns > {{numRows/3}}. This causes lesser number of reducers to be spun up in > queries. E.g Q79 in TPC-DS. > E.g: TPC-DS store table stats: > = > {noformat} > hive --orcfiledump > hdfs://nn:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_1000.db/store/00_0 > Stripe Statistics: > Stripe 1: > Column 0: count: 1002 hasNull: false > Column 1: count: 1002 hasNull: false min: 1 max: 1002 sum: 502503 > Column 2: count: 1002 hasNull: false min: AABA max: > PPBA sum: 16032 > Column 3: count: 1002 hasNull: false min: max: 2001-03-13 sum: 9950 > Column 4: count: 1002 hasNull: false min: max: 2001-03-12 sum: 5010 > Column 5: count: 273 hasNull: true min: 2450820 max: 2451313 sum: > 669141525 > Column 6: count: 1002 hasNull: false min: max: pri sum: 3916 > Column 7: count: 994 hasNull: true min: 200 max: 300 sum: 249970 > Column 8: count: 996 hasNull: true min: 5002549 max: 9997773 sum: > 7382689071 > Column 9: count: 1002 hasNull: false min: max: 8AM-8AM sum: 7088 > select compute_stats(s_employee_count, 16) from store; > {"columntype":"Long","min":200,"max":300,"countnulls":8,"numdistinctvalues":63,"ndvbitvector":"{0, > 1, 2, 3, 4, 5, 11, 12}{0, 1, 2, 3, 6}{0, 1, 2, 3, 4, 5, 7, 11}{0, 1, 2, 3, > 4, 5, 7}{0, 1, 2, 3, 4, 5, 6}{0, 1, 2, 3, 4, 5, 8}{0, 1, 2, 3, 4}{0, 1, 2, 3, > 4, 5, 7, 9}{0, 1, 2, 3, 4}{0}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6, 7}{0, > 1, 2, 3, 4, 8, 9, 14}{0, 1, 2, 3, 5}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, > 5, 6, 8}"} > {noformat} > {noformat} > explain select count(s_store_sk) from store where s_number_employees > 200 > and s_number_employees < 295; > {noformat} > Above query would first apply 1002/3 = 334 for {{s_number_employees > 200}} > and then 334 / 3 = 111 for {{s_number_employees < 295}}. Ideally it should > return all 1002 rows for filter {{s_number_employees > 200}}. > In TPC-DS Q79, this causes too less reduce tasks to be spun up causing > runtime delays. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVE-16290) Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when minValue == filterValue
[ https://issues.apache.org/jira/browse/HIVE-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated HIVE-16290: --- Attachment: HIVE-16290.2.patch > Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when > minValue == filterValue > - > > Key: HIVE-16290 > URL: https://issues.apache.org/jira/browse/HIVE-16290 > Project: Hive > Issue Type: Bug > Components: Statistics >Reporter: Rajesh Balamohan >Assignee: Gopal V >Priority: Minor > Attachments: HIVE-16290.1.patch, HIVE-16290.2.patch > > > Issue: > = > In {{StatsRulesProcFactory::evaluateCompator}}, when {{minValue}} is >= > filtered {{value}}, it should return all rows. Currently, it returns > {{numRows/3}}. This causes lesser number of reducers to be spun up in > queries. E.g Q79 in TPC-DS. > E.g: TPC-DS store table stats: > = > {noformat} > hive --orcfiledump > hdfs://nn:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_1000.db/store/00_0 > Stripe Statistics: > Stripe 1: > Column 0: count: 1002 hasNull: false > Column 1: count: 1002 hasNull: false min: 1 max: 1002 sum: 502503 > Column 2: count: 1002 hasNull: false min: AABA max: > PPBA sum: 16032 > Column 3: count: 1002 hasNull: false min: max: 2001-03-13 sum: 9950 > Column 4: count: 1002 hasNull: false min: max: 2001-03-12 sum: 5010 > Column 5: count: 273 hasNull: true min: 2450820 max: 2451313 sum: > 669141525 > Column 6: count: 1002 hasNull: false min: max: pri sum: 3916 > Column 7: count: 994 hasNull: true min: 200 max: 300 sum: 249970 > Column 8: count: 996 hasNull: true min: 5002549 max: 9997773 sum: > 7382689071 > Column 9: count: 1002 hasNull: false min: max: 8AM-8AM sum: 7088 > select compute_stats(s_employee_count, 16) from store; > {"columntype":"Long","min":200,"max":300,"countnulls":8,"numdistinctvalues":63,"ndvbitvector":"{0, > 1, 2, 3, 4, 5, 11, 12}{0, 1, 2, 3, 6}{0, 1, 2, 3, 4, 5, 7, 11}{0, 1, 2, 3, > 4, 5, 7}{0, 1, 2, 3, 4, 5, 6}{0, 1, 2, 3, 4, 5, 8}{0, 1, 2, 3, 4}{0, 1, 2, 3, > 4, 5, 7, 9}{0, 1, 2, 3, 4}{0}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6, 7}{0, > 1, 2, 3, 4, 8, 9, 14}{0, 1, 2, 3, 5}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, > 5, 6, 8}"} > {noformat} > {noformat} > explain select count(s_store_sk) from store where s_number_employees > 200 > and s_number_employees < 295; > {noformat} > Above query would first apply 1002/3 = 334 for {{s_number_employees > 200}} > and then 334 / 3 = 111 for {{s_number_employees < 295}}. Ideally it should > return all 1002 rows for filter {{s_number_employees > 200}}. > In TPC-DS Q79, this causes too less reduce tasks to be spun up causing > runtime delays. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVE-16290) Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when minValue == filterValue
[ https://issues.apache.org/jira/browse/HIVE-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-16290: Status: Patch Available (was: Open) > Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when > minValue == filterValue > - > > Key: HIVE-16290 > URL: https://issues.apache.org/jira/browse/HIVE-16290 > Project: Hive > Issue Type: Bug > Components: Statistics >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan >Priority: Minor > Attachments: HIVE-16290.1.patch > > > Issue: > = > In {{StatsRulesProcFactory::evaluateCompator}}, when {{minValue}} is >= > filtered {{value}}, it should return all rows. Currently, it returns > {{numRows/3}}. This causes lesser number of reducers to be spun up in > queries. E.g Q79 in TPC-DS. > E.g: TPC-DS store table stats: > = > {noformat} > hive --orcfiledump > hdfs://nn:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_1000.db/store/00_0 > Stripe Statistics: > Stripe 1: > Column 0: count: 1002 hasNull: false > Column 1: count: 1002 hasNull: false min: 1 max: 1002 sum: 502503 > Column 2: count: 1002 hasNull: false min: AABA max: > PPBA sum: 16032 > Column 3: count: 1002 hasNull: false min: max: 2001-03-13 sum: 9950 > Column 4: count: 1002 hasNull: false min: max: 2001-03-12 sum: 5010 > Column 5: count: 273 hasNull: true min: 2450820 max: 2451313 sum: > 669141525 > Column 6: count: 1002 hasNull: false min: max: pri sum: 3916 > Column 7: count: 994 hasNull: true min: 200 max: 300 sum: 249970 > Column 8: count: 996 hasNull: true min: 5002549 max: 9997773 sum: > 7382689071 > Column 9: count: 1002 hasNull: false min: max: 8AM-8AM sum: 7088 > select compute_stats(s_employee_count, 16) from store; > {"columntype":"Long","min":200,"max":300,"countnulls":8,"numdistinctvalues":63,"ndvbitvector":"{0, > 1, 2, 3, 4, 5, 11, 12}{0, 1, 2, 3, 6}{0, 1, 2, 3, 4, 5, 7, 11}{0, 1, 2, 3, > 4, 5, 7}{0, 1, 2, 3, 4, 5, 6}{0, 1, 2, 3, 4, 5, 8}{0, 1, 2, 3, 4}{0, 1, 2, 3, > 4, 5, 7, 9}{0, 1, 2, 3, 4}{0}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6, 7}{0, > 1, 2, 3, 4, 8, 9, 14}{0, 1, 2, 3, 5}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, > 5, 6, 8}"} > {noformat} > {noformat} > explain select count(s_store_sk) from store where s_number_employees > 200 > and s_number_employees < 295; > {noformat} > Above query would first apply 1002/3 = 334 for {{s_number_employees > 200}} > and then 334 / 3 = 111 for {{s_number_employees < 295}}. Ideally it should > return all 1002 rows for filter {{s_number_employees > 200}}. > In TPC-DS Q79, this causes too less reduce tasks to be spun up causing > runtime delays. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVE-16290) Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when minValue == filterValue
[ https://issues.apache.org/jira/browse/HIVE-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-16290: Description: Issue: = In {{StatsRulesProcFactory::evaluateCompator}}, when {{minValue}} is >= filtered {{value}}, it should return all rows. Currently, it returns {{numRows/3}}. This causes lesser number of reducers to be spun up in queries. E.g Q79 in TPC-DS. E.g: TPC-DS store table stats: = {noformat} hive --orcfiledump hdfs://nn:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_1000.db/store/00_0 Stripe Statistics: Stripe 1: Column 0: count: 1002 hasNull: false Column 1: count: 1002 hasNull: false min: 1 max: 1002 sum: 502503 Column 2: count: 1002 hasNull: false min: AABA max: PPBA sum: 16032 Column 3: count: 1002 hasNull: false min: max: 2001-03-13 sum: 9950 Column 4: count: 1002 hasNull: false min: max: 2001-03-12 sum: 5010 Column 5: count: 273 hasNull: true min: 2450820 max: 2451313 sum: 669141525 Column 6: count: 1002 hasNull: false min: max: pri sum: 3916 Column 7: count: 994 hasNull: true min: 200 max: 300 sum: 249970 Column 8: count: 996 hasNull: true min: 5002549 max: 9997773 sum: 7382689071 Column 9: count: 1002 hasNull: false min: max: 8AM-8AM sum: 7088 select compute_stats(s_employee_count, 16) from store; {"columntype":"Long","min":200,"max":300,"countnulls":8,"numdistinctvalues":63,"ndvbitvector":"{0, 1, 2, 3, 4, 5, 11, 12}{0, 1, 2, 3, 6}{0, 1, 2, 3, 4, 5, 7, 11}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6}{0, 1, 2, 3, 4, 5, 8}{0, 1, 2, 3, 4}{0, 1, 2, 3, 4, 5, 7, 9}{0, 1, 2, 3, 4}{0}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, 8, 9, 14}{0, 1, 2, 3, 5}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, 5, 6, 8}"} {noformat} {noformat} explain select count(s_store_sk) from store where s_number_employees > 200 and s_number_employees < 295; {noformat} Above query would first apply 1002/3 = 334 for {{s_number_employees > 200}} and then 334 / 3 = 111 for {{s_number_employees < 295}}. Ideally it should return all 1002 rows for filter {{s_number_employees > 200}}. In TPC-DS Q79, this causes too less reduce tasks to be spun up causing runtime delays. was: Issue: = In {{StatsRulesProcFactory::evaluateCompator}}, when {{minValue}} is >= filtered {{value}}, it should return all rows. Currently, it returns {{numRows/3}}. This lesser number of reducers to be spun up in queries. E.g Q79 in TPC-DS. E.g: TPC-DS store table stats: = {noformat} hive --orcfiledump hdfs://nn:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_1000.db/store/00_0 Stripe Statistics: Stripe 1: Column 0: count: 1002 hasNull: false Column 1: count: 1002 hasNull: false min: 1 max: 1002 sum: 502503 Column 2: count: 1002 hasNull: false min: AABA max: PPBA sum: 16032 Column 3: count: 1002 hasNull: false min: max: 2001-03-13 sum: 9950 Column 4: count: 1002 hasNull: false min: max: 2001-03-12 sum: 5010 Column 5: count: 273 hasNull: true min: 2450820 max: 2451313 sum: 669141525 Column 6: count: 1002 hasNull: false min: max: pri sum: 3916 Column 7: count: 994 hasNull: true min: 200 max: 300 sum: 249970 Column 8: count: 996 hasNull: true min: 5002549 max: 9997773 sum: 7382689071 Column 9: count: 1002 hasNull: false min: max: 8AM-8AM sum: 7088 select compute_stats(s_employee_count, 16) from store; {"columntype":"Long","min":200,"max":300,"countnulls":8,"numdistinctvalues":63,"ndvbitvector":"{0, 1, 2, 3, 4, 5, 11, 12}{0, 1, 2, 3, 6}{0, 1, 2, 3, 4, 5, 7, 11}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6}{0, 1, 2, 3, 4, 5, 8}{0, 1, 2, 3, 4}{0, 1, 2, 3, 4, 5, 7, 9}{0, 1, 2, 3, 4}{0}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, 8, 9, 14}{0, 1, 2, 3, 5}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, 5, 6, 8}"} {noformat} {noformat} explain select count(s_store_sk) from store where s_number_employees > 200 and s_number_employees < 295; {noformat} Above query would first apply 1002/3 = 334 for {{s_number_employees > 200}} and then 334 / 3 = 111 for {{s_number_employees < 295}}. Ideally it should return all 1002 rows for filter {{s_number_employees > 200}}. In TPC-DS Q79, this causes too less reduce tasks to be spun up causing runtime delays. > Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when > minValue == filterValue > - > > Key: HIVE-16290 > URL: https://issues.apache.org/jira/browse/HIVE-16290 > Project: Hive > Issue Type: Bug > Components: Statistics >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: HIVE-16290.1.patch > > > Issue: > = > In
[jira] [Updated] (HIVE-16290) Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when minValue == filterValue
[ https://issues.apache.org/jira/browse/HIVE-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-16290: Attachment: HIVE-16290.1.patch > Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when > minValue == filterValue > - > > Key: HIVE-16290 > URL: https://issues.apache.org/jira/browse/HIVE-16290 > Project: Hive > Issue Type: Bug > Components: Statistics >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: HIVE-16290.1.patch > > > Issue: > = > In {{StatsRulesProcFactory::evaluateCompator}}, when {{minValue}} is >= > filtered {{value}}, it should return all rows. Currently, it returns > {{numRows/3}}. This lesser number of reducers to be spun up in queries. E.g > Q79 in TPC-DS. > E.g: TPC-DS store table stats: > = > {noformat} > hive --orcfiledump > hdfs://nn:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_1000.db/store/00_0 > Stripe Statistics: > Stripe 1: > Column 0: count: 1002 hasNull: false > Column 1: count: 1002 hasNull: false min: 1 max: 1002 sum: 502503 > Column 2: count: 1002 hasNull: false min: AABA max: > PPBA sum: 16032 > Column 3: count: 1002 hasNull: false min: max: 2001-03-13 sum: 9950 > Column 4: count: 1002 hasNull: false min: max: 2001-03-12 sum: 5010 > Column 5: count: 273 hasNull: true min: 2450820 max: 2451313 sum: > 669141525 > Column 6: count: 1002 hasNull: false min: max: pri sum: 3916 > Column 7: count: 994 hasNull: true min: 200 max: 300 sum: 249970 > Column 8: count: 996 hasNull: true min: 5002549 max: 9997773 sum: > 7382689071 > Column 9: count: 1002 hasNull: false min: max: 8AM-8AM sum: 7088 > select compute_stats(s_employee_count, 16) from store; > {"columntype":"Long","min":200,"max":300,"countnulls":8,"numdistinctvalues":63,"ndvbitvector":"{0, > 1, 2, 3, 4, 5, 11, 12}{0, 1, 2, 3, 6}{0, 1, 2, 3, 4, 5, 7, 11}{0, 1, 2, 3, > 4, 5, 7}{0, 1, 2, 3, 4, 5, 6}{0, 1, 2, 3, 4, 5, 8}{0, 1, 2, 3, 4}{0, 1, 2, 3, > 4, 5, 7, 9}{0, 1, 2, 3, 4}{0}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6, 7}{0, > 1, 2, 3, 4, 8, 9, 14}{0, 1, 2, 3, 5}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, > 5, 6, 8}"} > {noformat} > {noformat} > explain select count(s_store_sk) from store where s_number_employees > 200 > and s_number_employees < 295; > {noformat} > Above query would first apply 1002/3 = 334 for {{s_number_employees > 200}} > and then 334 / 3 = 111 for {{s_number_employees < 295}}. Ideally it should > return all 1002 rows for filter {{s_number_employees > 200}}. > In TPC-DS Q79, this causes too less reduce tasks to be spun up causing > runtime delays. -- This message was sent by Atlassian JIRA (v6.3.15#6346)