zeroshade commented on issue #4271:
URL: https://github.com/apache/arrow-adbc/issues/4271#issuecomment-4328798876

   > 1M rows, mixed types. Snowflake is roughly tied with JDBC. Databricks ADBC 
is clearly slower - most of the time is spent waiting on the ADBC reader. Our 
best guess is that Databricks's CloudFetch path (results delivered through S3) 
is the bottleneck when running across regions, while JDBC streams directly via 
Thrift. But that's a separate conversation.
   
   This is definitely something we should look into, but yea it's a separate 
conversation. CC @lidavidm
   
   > 1M rows, strings - the topic of this discussion. On the JDBC side, our 
Java client dict-encodes strings before sending, so the wire bytes are all 2–3× 
smaller. With plain UTF-8 from ADBC, each of those steps does more work. Our 
own workaround that checks each column and casts it to a Dictionary gets 
Snowflake to ~1.6× JDBC on this test (down from ~3.2× without it) - much 
better, but still not even.
   
   It's curious that JDBC would be faster in this case since for Snowflake it 
would have to convert from Arrow -> Rows for JDBC before you do your 
dict-encoding which then would convert back to Arrow. Though I guess it would 
depend on how you implemented the dictionary casting/conversion (I'm not 
familiar enough with the Java implementation to know how good it is there). If 
the dictionary encoding is improving the performance that much, can you confirm 
where the bottleneck is and if it's the ADBC interaction or somewhere else?
   
   It would definitely be an improvement if we could convince 
Snowflake/Databricks to implement dictionary encoding on the server side 
though. I'll see if I can do anything there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to