Databricks): Use ArrowSchema for Response Schema [arrow-adbc]

via GitHub Mon, 21 Jul 2025 12:06:23 -0700


CurtHagenlocher commented on PR #3140:
URL: https://github.com/apache/arrow-adbc/pull/3140#issuecomment-3098031698


   I would say that there are three different ways of thinking about this, 
depending on who the user is. For a new user who wants to consume Databricks 
data into their new C# application, I imagine they would want the loaded data 
to represent Spark's view of the data as closely as possible. This would 
include preserving type information and values with as much fidelity as the 
Arrow format allows. For a user who is currently consuming data into a .NET 
application* via ODBC, I suspect they would -- at least initially -- want the 
loaded data to be as similar as possible to what the ODBC driver is returning. 
(This doesn't line up perfectly for reasons I'll get into.) Finally, for use 
specifically inside Power BI the user would want to get the same results 
whether they're using the connector with ODBC or with ADBC. The latter two are 
(at least at first blush) pretty well-aligned**, because in both cases there's 
some client code that's switching from ODBC to ADBC for which we want to minim
 ize the transition costs.
   
   I'm not currently in a position to test the older server version via ODBC as 
I've had to decommission the Databricks instance I was using to test as it 
wasn't compliant with internal security restrictions, but I would be extremely 
surprised if it was returning decimal data as a string. And at a minimum, it 
would need to report the type of the result column as being decimal in order to 
let the client application know what type it is. But the difference is that 
ODBC is able to report the type as decimal while still retaining an internal 
representation of the data as a string. That's because fetching the data with 
ODBC specifically requires that you say what format you want it returned as. So 
even if the internal buffer contains a string, the client application would see 
that the column type is SQL_DECIMAL and it would say "I want this data 
formatted as SQL_C_DECIMAL" and the driver would need to perform any necessary 
conversion. This possibility doesn't exist with ADBC because there
  is no similar distinction. The declared type has to be consistent with the 
type of the returned data buffer.
   
   Earlier I had mentioned that the Power BI connector is doing some data/type 
translation. The context for this is that we ordinarily compute the expected 
type of the result set and then if the actual type doesn't match, the ADBC code 
in Power BI will inject a transformation. However, this only works when the 
user references tables in the catalog. In the scenario where the user supplies 
a native SQL query and we run it to discover the schema output, returning a 
decimal as string will mean that the original type is lost. This will make the 
data harder to work with in Power BI and would break backwards-compatibility.
   
   Tl;dr: I'm afraid the driver will need to translate the data.
   
   *Note that we still intend to make this driver work for non-.NET consumers 
by adding AOT compilation support. The main gap is some missing functionality 
in the core Arrow C# library for which there's a prototype implementation.
   
   **That said, I think we'd love to be able to represent nested lists, records 
or tables in Power BI as their native types, because the conversion of all 
structured data into JSON is both lossy and limiting in terms of the kinds of 
querying we can do against the data source.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema [arrow-adbc]

Reply via email to