Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 )
Change subject: IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate ...................................................................... Patch Set 46: (13 comments) Left some comments, but the code LGTM overall. Though I think I'll need to go over it a few more times. http://gerrit.cloudera.org:8080/#/c/16720/46//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16720/46//COMMIT_MSG@54 PS46, Line 54: 2 nit: 3 http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/exec/parquet/hdfs-parquet-scanner.h@591 PS46, Line 591: & nit: for output values we usually use pointer types. http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/runtime/date-value.cc File be/src/runtime/date-value.cc: http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/runtime/date-value.cc@370 PS46, Line 370: days_since_epoch_ - days nit: this expression is used 3 times in this function. Probably worth to extract it to a variable. http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/runtime/timestamp-value.cc@226 PS46, Line 226: date_ + boost::gregorian::date_duration(1), This doesn't work well if time duration is greater than a day. http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/util/min-max-filter-ir.cc File be/src/util/min-max-filter-ir.cc: http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/util/min-max-filter-ir.cc@32 PS46, Line 32: else { I think we shouldn't have this 'else', because if we only insert a single element into the min/max filter then we want to set both min_ and max_. http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/util/min-max-filter.h File be/src/util/min-max-filter.h: http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/util/min-max-filter.h@118 PS46, Line 118: max(filter_min, data_min) - min(filter_max, data_max) + 1 It's reversed, it should be min(filter_max, data_max) - max(filter_min, data_min) + 1 http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/util/min-max-filter.cc File be/src/util/min-max-filter.cc: http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/util/min-max-filter.cc@722 PS46, Line 722: /* If the filter completely covers the data range, return 1.0*/ : if (filter_min <= data_min && data_max <= filter_max) { : return 1.0; : } : double overlap_min = std::max(filter_min, data_min); : double overlap_max = std::min(filter_max, data_max); : return (float)((overlap_max - overlap_min + 1) / (data_max - data_min + 1)); This is the same for all the 3 branches, so it could be moved after the switch stmt. http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/util/min-max-filter.cc@735 PS46, Line 735: min4_ min8_? http://gerrit.cloudera.org:8080/#/c/16720/46/be/src/util/min-max-filter.cc@749 PS46, Line 749: min4_ min16_? http://gerrit.cloudera.org:8080/#/c/16720/46/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/16720/46/common/thrift/ImpalaService.thrift@623 PS46, Line 623: greater than 0 to enable You could mention that we can only set it to a value between 0.0 and 1.0, and 1.0 means that the filter is always evaluated. http://gerrit.cloudera.org:8080/#/c/16720/46/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16720/46/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@686 PS46, Line 686: overla nit: overlap http://gerrit.cloudera.org:8080/#/c/16720/46/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@732 PS46, Line 732: difficult nit: word 'difficult' is redundant http://gerrit.cloudera.org:8080/#/c/16720/46/testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test File testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test: http://gerrit.cloudera.org:8080/#/c/16720/46/testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test@a42 PS46, Line 42: Why are the distributed plans being removed? -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 46 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 12 Jan 2021 16:14:22 +0000 Gerrit-HasComments: Yes