waynexia commented on pull request #792: URL: https://github.com/apache/arrow-datafusion/pull/792#issuecomment-920182797
Hi @houqp, here are some updates. The reproducer with rust dataframe API is [here](https://github.com/apache/arrow-datafusion/pull/792/commits/64330bdbe794b91789cdf45cd2127fe5b418a1a7#diff-a119e5d1231fc6f2551e39bf9427ed1499e18905054632e441895684e372c7afR2162). IMO the problem we are facing is that `Filter` plan doesn't have its own schema. And it doesn't require following right after the plan it's going to filter too (a bit wired to me... I haven't inspected how other systems act), which caused this problem. Taking the reproducer for example, the optimizer use "input plan" 's schema to query expr's type and got nothing. The `Filter` is actually performed in the table scan plan, and that's where to get the schema. Later in another optimizer `FilterPushdown` the `Filter` plan is moved after the table scan plan. I changed the behavior of getting schema in https://github.com/apache/arrow-datafusion/pull/792/commits/64330bdbe794b91789cdf45cd2127fe5b418a1a7. It now will get all the schemas under the `Filter` plan and merge them into one for querying. This can pass the tests (locally) but I'm wondering whether there are some other approaches to achieve this. One in my mind is to place this optimizer after `FilterPushDown`, which may solve this problem since `Filter` is in the "right" place now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
