[GitHub] [arrow] bkietz commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

GitBox Wed, 05 May 2021 10:07:47 -0700


bkietz commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626746016




##########
File path: cpp/src/arrow/dataset/file_parquet_test.cc
##########
@@ -220,6 +220,41 @@ TEST_F(TestParquetFileFormat, 
WriteRecordBatchReaderCustomOptions) {
                     *actual_schema);
 }
 
+TEST_F(TestParquetFileFormat, CountRows) { TestCountRows(); }
+
+TEST_F(TestParquetFileFormat, CountRowsPredicatePushdown) {
+  constexpr int64_t kNumRowGroups = 16;
+  constexpr int64_t kTotalNumRows = kNumRowGroups * (kNumRowGroups + 1) / 2;
+
+  // See PredicatePushdown test below for a description of the generated data
+  auto reader = ArithmeticDatasetFixture::GetRecordBatchReader(kNumRowGroups);
+  auto source = GetFileSource(reader.get());
+  auto options = std::make_shared<ScanOptions>();
+
+  auto fragment = MakeFragment(*source);
+
+  ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(kTotalNumRows),
+                            fragment->CountRows(literal(true), options));
+
+  for (int i = 1; i <= kNumRowGroups; i++) {
+    SCOPED_TRACE(i);
+    // The row group for which all values in column i64 == i has i rows
+    auto predicate = less_equal(field_ref("i64"), literal(i));
+    ASSERT_OK_AND_ASSIGN(predicate, predicate.Bind(*reader->schema()));
+    auto expected = i * (i + 1) / 2;
+    ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(expected),
+                              fragment->CountRows(predicate, options));
+
+    // N.B. SimplifyWithGuarantee can't handle simplifying (i64 == 1) against 
(i64 <= 1 &
+    // i64 >= 1) right now, but this works

Review comment:
       Instead of catching a degenerate case like this, it seems more 
reasonable to detect `min->Equals(max)` in ColumnChunkStatisticsAsExpression 
and emit `equal(field_expr, min)` instead of `and_(greater_equal(field_expr, 
min), less_equal(field_expr, max))`

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -592,18 +625,34 @@ Result<std::vector<int>> 
ParquetFileFragment::FilterRowGroups(
     }
   }
 
-  std::vector<int> row_groups;
+  std::vector<compute::Expression> row_groups(row_groups_->size());
   for (size_t i = 0; i < row_groups_->size(); ++i) {
     ARROW_ASSIGN_OR_RAISE(auto row_group_predicate,
                           SimplifyWithGuarantee(predicate, 
statistics_expressions_[i]));
-    if (row_group_predicate.IsSatisfiable()) {
-      row_groups.push_back(row_groups_->at(i));
-    }
+    row_groups[i] = std::move(row_group_predicate);
   }
-
   return row_groups;
 }
 
+Result<util::optional<int64_t>> ParquetFileFragment::TryCountRows(
+    compute::Expression predicate) {
+  DCHECK_NE(metadata_, nullptr);
+  if (ExpressionHasFieldRefs(predicate)) {
+    ARROW_ASSIGN_OR_RAISE(auto expressions, 
TestRowGroups(std::move(predicate)));
+    int64_t rows = 0;
+    for (size_t i = 0; i < row_groups_->size(); i++) {
+      // Unless the row group is entirely included, bail out of fast path
+      if (expressions[i] == compute::literal(false)) continue;

Review comment:
       `expression[i]` *could* be simplified to a literal null, for example if 
`i64` happened to be null throughout a row group. This'd be a good unit test too
   ```suggestion
         // If a row group is entirely excluded, exclude its rows from the count
         if (!expressions[i].IsSatisfiable()) continue;
         // Unless the row group is entirely included, bail out of fast path
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Reply via email to