CurtHagenlocher commented on PR #3140: URL: https://github.com/apache/arrow-adbc/pull/3140#issuecomment-3098031698
I would say that there are three different ways of thinking about this, depending on who the user is. For a new user who wants to consume Databricks data into their new C# application, I imagine they would want the loaded data to represent Spark's view of the data as closely as possible. This would include preserving type information and values with as much fidelity as the Arrow format allows. For a user who is currently consuming data into a .NET application* via ODBC, I suspect they would -- at least initially -- want the loaded data to be as similar as possible to what the ODBC driver is returning. (This doesn't line up perfectly for reasons I'll get into.) Finally, for use specifically inside Power BI the user would want to get the same results whether they're using the connector with ODBC or with ADBC. The latter two are (at least at first blush) pretty well-aligned**, because in both cases there's some client code that's switching from ODBC to ADBC for which we want to minim ize the transition costs. I'm not currently in a position to test the older server version via ODBC as I've had to decommission the Databricks instance I was using to test as it wasn't compliant with internal security restrictions, but I would be extremely surprised if it was returning decimal data as a string. And at a minimum, it would need to report the type of the result column as being decimal in order to let the client application know what type it is. But the difference is that ODBC is able to report the type as decimal while still retaining an internal representation of the data as a string. That's because fetching the data with ODBC specifically requires that you say what format you want it returned as. So even if the internal buffer contains a string, the client application would see that the column type is SQL_DECIMAL and it would say "I want this data formatted as SQL_C_DECIMAL" and the driver would need to perform any necessary conversion. This possibility doesn't exist with ADBC because there is no similar distinction. The declared type has to be consistent with the type of the returned data buffer. Earlier I had mentioned that the Power BI connector is doing some data/type translation. The context for this is that we ordinarily compute the expected type of the result set and then if the actual type doesn't match, the ADBC code in Power BI will inject a transformation. However, this only works when the user references tables in the catalog. In the scenario where the user supplies a native SQL query and we run it to discover the schema output, returning a decimal as string will mean that the original type is lost. This will make the data harder to work with in Power BI and would break backwards-compatibility. Tl;dr: I'm afraid the driver will need to translate the data. *Note that we still intend to make this driver work for non-.NET consumers by adding AOT compilation support. The main gap is some missing functionality in the core Arrow C# library for which there's a prototype implementation. **That said, I think we'd love to be able to represent nested lists, records or tables in Power BI as their native types, because the conversion of all structured data into JSON is both lossy and limiting in terms of the kinds of querying we can do against the data source. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org