zeroshade commented on issue #4271: URL: https://github.com/apache/arrow-adbc/issues/4271#issuecomment-4328798876
> 1M rows, mixed types. Snowflake is roughly tied with JDBC. Databricks ADBC is clearly slower - most of the time is spent waiting on the ADBC reader. Our best guess is that Databricks's CloudFetch path (results delivered through S3) is the bottleneck when running across regions, while JDBC streams directly via Thrift. But that's a separate conversation. This is definitely something we should look into, but yea it's a separate conversation. CC @lidavidm > 1M rows, strings - the topic of this discussion. On the JDBC side, our Java client dict-encodes strings before sending, so the wire bytes are all 2–3× smaller. With plain UTF-8 from ADBC, each of those steps does more work. Our own workaround that checks each column and casts it to a Dictionary gets Snowflake to ~1.6× JDBC on this test (down from ~3.2× without it) - much better, but still not even. It's curious that JDBC would be faster in this case since for Snowflake it would have to convert from Arrow -> Rows for JDBC before you do your dict-encoding which then would convert back to Arrow. Though I guess it would depend on how you implemented the dictionary casting/conversion (I'm not familiar enough with the Java implementation to know how good it is there). If the dictionary encoding is improving the performance that much, can you confirm where the bottleneck is and if it's the ADBC interaction or somewhere else? It would definitely be an improvement if we could convince Snowflake/Databricks to implement dictionary encoding on the server side though. I'll see if I can do anything there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
