AngersZhuuuu commented on pull request #28805:
URL: https://github.com/apache/spark/pull/28805#issuecomment-646445716


   > The CNF process should break down `dt = 20190626 and id in (1,2,3)` to 
`Seq((dt = 20190626), (id in (1,2,3))`, and then these two sub-predicates will 
be processed in `groupExpressionsByQualifier`. What is the problem here?
   
   In current partition pruning,  ScanOperation get predicates by 
`splitConjunctivePredicates` , 
   if there is ```(dt = 1 or (dt = 2 and id = 3))```,  it won't be seperated, 
then since this expression is reference contains (id, dt), it won't be pushed 
down as a partition predicates.  Then it will scan all data in the partition 
table.
   ```
   object HiveTableScans extends Strategy {
       def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
         case ScanOperation(projectList, predicates, relation: 
HiveTableRelation) =>
           // Filter out all predicates that only deal with partition keys, 
these are given to the
           // hive table scan operator to be used for partition pruning.
           val partitionKeyIds = AttributeSet(relation.partitionCols)
           val (pruningPredicates, otherPredicates) = predicates.partition { 
predicate =>
             !predicate.references.isEmpty &&
             predicate.references.subsetOf(partitionKeyIds)
           }
   
           pruneFilterProject(
             projectList,
             otherPredicates,
             identity[Seq[Expression]],
             HiveTableScanExec(_, relation, pruningPredicates)(sparkSession)) 
:: Nil
         case _ =>
           Nil
       }
     }
   ```
   
   With convert to CNF, ```(dt = 1 or (dt = 2 and id = 3))``` will be converted 
to ```(dt = 1 or dt = 2) and  (dt = 1 or id = 3))``` then this expression can 
be split by  `splitConjunctivePredicates` and split to two expression ```(dt = 
1 or dt = 2)``` and ``` (dt = 1 or id = 3))```, then  ```(dt = 1 or dt = 2)``` 
can be pushed down as partition pruning predicates.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to