alamb commented on issue #17718: URL: https://github.com/apache/datafusion/issues/17718#issuecomment-3338050011
More feedback from @zhangfengcdt in discord which I think is also really interesting: Really nice discussion! When we implement the KNN join in SedonaDB, we had the similar concerns and did some experiments, and found the marker function + optimizer rule pattern works well for our case. There are mainly two challenges in our case: (1) asymmetric KNN execution, meaning we need to build a spatial index on the build side, and for each left side geometry find the k nearest neighbors (2) join order control, meaning we need to control the predicate evaluation order For the first challenge, we register a stub scalar UDF and in query planner, we detect ST_KNN predicates in join filters and transforms them into a specialized SpatialJoinExec physical plan with KNN semantics. for the second challange, we add a barrier function to serve as an optimization barrier to prevent filter pushdown and control predicate evaluation order. This is critical for maintaining semantic correctness (KNN then filter vs. filter then KNN). Both work well for the purpose. The marker function + optimizer rule pattern is indeed the most practical approach for adding custom join strategies to DataFusion/Arrow-based systems. It's more robust than it might seem because: The optimizer rule has full control over when to apply the transformation The stub function provides type checking and documentation It integrates naturally with SQL without parser modifications We have the similar approach on Apache SedonaSpark as well. I would recommend these steps for custom joins for reference: Use marker functions for SQL integration Implement robust pattern matching in optimizer rules Provide optimization barriers when semantics are order-dependent Document the transformation clearly for users Consider providing both SQL and DataFrame APIs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
