PavloPolovyi commented on issue #4271: URL: https://github.com/apache/arrow-adbc/issues/4271#issuecomment-4333382552
> If the dictionary encoding is improving the performance that much, can you confirm where the bottleneck is and if it's the ADBC interaction or somewhere else? The ADBC interaction is not the bottleneck. Warehouse fetch takes about the same on both paths (~3 s). After fetch, the limiting factor is the overall size of the data going through our pipeline. The JDBC path converts the data into our in-house columnar format on the way out and dict-encodes strings as part of that conversion, so its payload ends up around 3× smaller (~40 MB vs ~120 MB on this query). Less data through the rest of the pipeline, faster end-to-end. About "JDBC has to convert Arrow → rows → back to columnar" - true on paper, and it's exactly why ADBC wins on every workload where the wire representation is the same on both paths (though only 20 - 30%). The string-heavy case is special because the JDBC path's column rebuild gives us dict-encoding essentially for free. The ADBC path correctly skips the rebuild, but that also means dict-encoding doesn't fall out naturally from the pipeline - we had to add a dedicated pass for it. On the JDBC path, the column-build step happens to be a natural place to categorize strings, and the cost is small relative to the size win. Server-side dictionary encoding from Snowflake/Databricks would skip our user-side cast entirely and shrink the fetch payload too. Really appreciate you offering to push on that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
