ulysses-you commented on pull request #35789:
URL: https://github.com/apache/spark/pull/35789#issuecomment-1063664600


   > I have a question: why do we need Semi-Join if we have Bloom Filter?
   
   I guess it is a trade-off between benifits and costs. BloomFilter has false 
positives issue and it get worse if the data set is large. So if the creation 
side (from the design docs) is small enough which can be broadcast, we can use 
semi-join to get more benifits with less cost since it is accuracy. And It is 
something like dpp did.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to