Re: [I] String columns come back as `Utf8` even when cardinality is tiny — is that expected? [arrow-adbc]

via GitHub Tue, 28 Apr 2026 01:03:10 -0700


PavloPolovyi commented on issue #4271:
URL: https://github.com/apache/arrow-adbc/issues/4271#issuecomment-4333382552


   > If the dictionary encoding is improving the performance that much, can you 
confirm where the bottleneck is and if it's the ADBC interaction or somewhere 
else?
   
   The ADBC interaction is not the bottleneck. Warehouse fetch takes about the 
same on both paths (~3 s). After fetch, the limiting factor is the overall size 
of the data going through our pipeline. The JDBC path converts the data into 
our in-house columnar format on the way out and dict-encodes strings as part of 
that conversion, so its payload ends up around 3× smaller (~40 MB vs ~120 MB on 
this query). Less data through the rest of the pipeline, faster end-to-end.
   
   About "JDBC has to convert Arrow → rows → back to columnar" - true on paper, 
and it's exactly why ADBC wins on every workload where the wire representation 
is the same on both paths (though only 20 - 30%). The string-heavy case is 
special because the JDBC path's column rebuild gives us dict-encoding 
essentially for free. The ADBC path correctly skips the rebuild, but that also 
means dict-encoding doesn't fall out naturally from the pipeline - we had to 
add a dedicated pass for it. On the JDBC path, the column-build step happens to 
be a natural place to categorize strings, and the cost is small relative to the 
size win.
   
   Server-side dictionary encoding from Snowflake/Databricks would skip our 
user-side cast entirely and shrink the fetch payload too. Really appreciate you 
offering to push on that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] String columns come back as `Utf8` even when cardinality is tiny — is that expected? [arrow-adbc]

Reply via email to