avaskys opened a new issue, #15692:
URL: https://github.com/apache/iceberg/issues/15692

   ### Feature Request / Improvement
   
   Spark batch reads from Iceberg tables push down their filter expressions, 
enabling manifest-level pruning (via partition range summaries), file-level 
pruning (via column min/max statistics), and partition elimination. Spark 
structured streaming reads do not currently benefit from any of this, but it 
would be valuable to support filter pushdown in the `MicroBatchStream` path as 
well.
   
   Today, a streaming query like 
`.readStream.format("iceberg").load("t").filter("partition_col = 'foo'")` will 
create Spark tasks for and read files across all partitions. The filter is only 
applied as a post-read record filter by Spark. For streaming reads with 
partition filters, this can create significant unnecessary I/O, task overhead, 
and compute cost.
   
   The core API already supports this. `IncrementalAppendScan` inherits 
`filter(Expression)` from the `Scan` interface, and `BaseIncrementalAppendScan` 
correctly threads it to `ManifestGroup.filterData()` for the full pruning 
pipeline. The gap is in the Spark connector: `SparkScan.toMicroBatchStream()` 
does not pass filter expressions to `SparkMicroBatchStream`, so they are never 
applied.
   
   Closing this gap would bring streaming reads to parity with batch reads for 
filter pushdown, benefiting both partition-based and column statistics-based 
pruning.
   
   Affects all maintained Spark connector versions: v3.4, v3.5, v4.0, v4.1.
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [x] I can contribute this improvement/feature independently
   - [x] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to