ulysses-you commented on pull request #35789: URL: https://github.com/apache/spark/pull/35789#issuecomment-1063664600
> I have a question: why do we need Semi-Join if we have Bloom Filter? I guess it is a trade-off between benifits and costs. BloomFilter has false positives issue and it get worse if the data set is large. So if the creation side (from the design docs) is small enough which can be broadcast, we can use semi-join to get more benifits with less cost since it is accuracy. And It is something like dpp did. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org