Github user sadikovi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19810#discussion_r154815012
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
    @@ -193,38 +195,68 @@ case class InMemoryTableScanExec(
     
       private val inMemoryPartitionPruningEnabled = 
sqlContext.conf.inMemoryPartitionPruning
     
    +  private def doFilterCachedBatches(
    +      rdd: RDD[CachedBatch],
    +      partitionStatsSchema: Seq[AttributeReference]): RDD[CachedBatch] = {
    +    val schemaIndex = partitionStatsSchema.zipWithIndex
    +    rdd.mapPartitionsWithIndex {
    +      case (partitionIndex, cachedBatches) =>
    +        if (inMemoryPartitionPruningEnabled) {
    +          cachedBatches.filter { cachedBatch =>
    +            val partitionFilter = newPredicate(
    +              partitionFilters.reduceOption(And).getOrElse(Literal(true)),
    +              partitionStatsSchema)
    +            partitionFilter.initialize(partitionIndex)
    +            if (!partitionFilter.eval(cachedBatch.stats)) {
    --- End diff --
    
    Are there any issues with discarding a partition based on statistics that 
could be partially computed (e.g. when total size in bytes of a partition 
iterator is larger than configurable batch size) as per 
https://github.com/apache/spark/pull/19810/files#diff-5fc188468d3066580ea9a766114b8f1dR74?
 
    
    Would be it be beneficial to record such situation by logging it, and still 
include such partition when statistics are partially computed and filters are 
evaluated to false, or discard all statistics when some of the partitions hit 
this situation? Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to