lidavidm commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626783124



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -592,18 +625,34 @@ Result<std::vector<int>> 
ParquetFileFragment::FilterRowGroups(
     }
   }
 
-  std::vector<int> row_groups;
+  std::vector<compute::Expression> row_groups(row_groups_->size());
   for (size_t i = 0; i < row_groups_->size(); ++i) {
     ARROW_ASSIGN_OR_RAISE(auto row_group_predicate,
                           SimplifyWithGuarantee(predicate, 
statistics_expressions_[i]));
-    if (row_group_predicate.IsSatisfiable()) {
-      row_groups.push_back(row_groups_->at(i));
-    }
+    row_groups[i] = std::move(row_group_predicate);
   }
-
   return row_groups;
 }
 
+Result<util::optional<int64_t>> ParquetFileFragment::TryCountRows(
+    compute::Expression predicate) {
+  DCHECK_NE(metadata_, nullptr);
+  if (ExpressionHasFieldRefs(predicate)) {
+    ARROW_ASSIGN_OR_RAISE(auto expressions, 
TestRowGroups(std::move(predicate)));
+    int64_t rows = 0;
+    for (size_t i = 0; i < row_groups_->size(); i++) {
+      // Unless the row group is entirely included, bail out of fast path
+      if (expressions[i] == compute::literal(false)) continue;

Review comment:
       Got it, thanks. I added this case and the case above. However, you still 
can't count rows with a predicate like `is_null(i64)` because there's no 
information about columns being non-null. (And even if it's added, the 
expression simplification can't handle it because it extracts a table of what 
fields _are_ what values, but not what values fields _are not_.)
   
   I also fixed a small bug: in column statistics, the total number of rows is 
`num_values() + null_count()` (so emit `is_null` if `null_count() > 0 && 
num_values == 0`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to