xudong963 commented on code in PR #22024:
URL: https://github.com/apache/datafusion/pull/22024#discussion_r3224055862


##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -882,11 +887,33 @@ impl FiltersPreparedParquetOpen {
 
         // Determine which row groups to actually read. The idea is to skip
         // as many row groups as possible based on the metadata and query
-        let mut row_groups = RowGroupAccessPlanFilter::new(create_initial_plan(
+        let mut initial_plan = create_initial_plan(
             &prepared.file_name,
-            &prepared.extensions,
+            prepared.extensions.clone(),
             rg_metadata.len(),
-        )?);
+        )?;
+
+        // Apply optional row-group and row-range sampling now that we
+        // know the actual row-group count. Both calls are no-ops when
+        // their respective fraction is `None`. Selection is
+        // deterministic per `(partition_index, row_group_index,
+        // fraction, cluster_size)` so re-runs match. The execution
+        // `partition_index` is the stable per-file id we plumb in:
+        // it makes sampling reproducible across environments without
+        // depending on object-store paths, and decorrelates files
+        // assigned to different partitions.
+        prepared.sampling.apply_row_group_sampling(

Review Comment:
   The current plumbing passes `prepared.partition_index`, so multiple files in 
the same partition collide.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to