GitHub user robo-todd edited a discussion: Join problems with custom TableProviders
I'm making an assessment if we can use DataFusion for our application which would fall into the "Custom Database" category use case. We will be creating lots of custom TableProviders for this work. (Thanks for the brilliantly flexible ways we can extend DataFusion!). My problem is that when I start to create joins with our custom TableProviders I get crippling performance problems that basically eliminate DataFusion for our use case. What is happening (from printing out what I get from running joins, etc.) is that the Inner (and all others I have tested, RightSemi, LeftSemi, etc.) returns a large number (18 or more) tiny record batches (like 1 row each) of results from joins pulling from our custom TableProviders. All the extra overhead and allocation is really killing performance. I have written essentially the same test using built-in MemTable providers and these tests perform a minimum of 10x faster for the same data sizes (in our case small batches < 64 at the moment), basically same data layouts, etc. and I've compared the constraints (each of our tables has a primary index in column 0 in all cases). When I run the same join logical plan structure in the MemTable test it returns single RecordBatches of a reasonable number of rows (for the small number of rows in this test). Whereas, the same join (as far as I can tell) on our custom TableProviders returns a giant pile of ~ 20 record batches of row size = 1 or maybe sometimes 2. When I compare just running logical scans pulling columns from our TableProviders they are comparable to MemTable performance. Any ideas on how to figure out what I'm doing wrong here? GitHub link: https://github.com/apache/datafusion/discussions/16981 ---- This is an automatically sent email for github@datafusion.apache.org. To unsubscribe, please send an email to: github-unsubscr...@datafusion.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org