Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
lidavidm commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3095887811 We have a path for handling GetFlightInfo not returning a schema, so hopefully it's a small-ish change to opt to always ignore it: https://github.com/apache/arrow-adbc/blob/ebb2fd09adf9bc8d5a56846f1b53be15490abc25/go/adbc/driver/flightsql/record_reader.go#L86-L98 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
vandop commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3095833403 Yap, agree we are fixing it on Dremio as well. Given GetFlightInfo may not return a schema (being it optional), enforcing it as the source of truth, maybe a stretch, especially if the subsequent DoGet calls can get the "real" schema and ensure it is consistent among the many calls. I'll try to take a stab on the change and check if it is small enough, after we have the fix. If becomes cumbersome I'll leave it as is so it doesn't make it vendor specific complexity. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
lidavidm commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3095269006 I think a PR to add an option to wait for the DoGet to get the schema might be OK then, depending on how complex that change is. Anything more complicated and I think we would want to fork a Dremio-specific driver. But I would still prefer that Dremio fix this on their end :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
vandop commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3069411747 > To be clear: does this mean that Dremio is at least consistent between different workers for DoGet calls? Yes. The problem is only between GetFlightInfo where Dremio does try to infer the schema without triggering an execution of the query vs DoGet where the actual execution happens and the real schema is returned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
lidavidm commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3069063622 Because I think a lot of the concern here is that different workers could return different schemas. If it's just the case that the DoGet schema is always consistent and correct, well, I'd still question how this arises in Dremio (and whether it's actually guaranteed to be consistent...) but that seems like a much smaller tweak to make. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
lidavidm commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3069057252 > But answering the question about the rest of the drivers, no other driver JDBC AF/ODBC AF we have does enforce Schemas to be consistent across GetFlightInfo and DoGet calls. To be clear: does this mean that Dremio is at least consistent between different workers for DoGet calls? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
vandop commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3069007912 Thanks for the input. I've done a bit more digging, seems internally there are still some edge cases we may be able to avoid it for most. But answering the question about the rest of the drivers, no other driver JDBC AF/ODBC AF we have does enforce Schemas to be consistent across GetFlightInfo and DoGet calls. In the meantime if this strictness is by design, then it makes sense we would fork the driver or fix the server. I was mostly surprised that different drivers seem to have different behavior in this strictness. But maybe the fix is to work on the other drivers instead :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
zeroshade commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3063654670 > It may still be possible as an option (most of this functionality should be available in arrow-go already) but I'd like to see what @zeroshade thinks (should we move this to the mailing list to put in front of the other Flight SQL users?). Personally, my view is similar to @lidavidm in that I'm curious why there isn't a consistent schema in the first place that Dremio workers could cast to before output. Similar to @CurtHagenlocher I would like to know how the Dremio ODBC driver avoids this issue also, given that as far as I'm aware it is also using Arrow Flight/FlightSQL under the hood and thus would run into the same problem. The issue with Snowflake that was brought up is that internally, they do not maintain a min/max over the entire result set to know what the final Arrow schema should be up front, and instead just use the smallest precision for a given chunk. In Dremio's case you are already using Arrow internally throughout the entire system, so everything should in theory map easily to Arrow types enabling the planner and workers to easily know what to cast to. If this can't be done on the server side, I'd prefer we simply create a *dremio* ADBC driver which performs any necessary casting / handling of inconsistencies rather than pushing this onto the generic Flight SQL situation. As David said, Flight SQL shouldn't *require* a client to perform casting and transformations, it should just require handling the Arrow memory format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
CurtHagenlocher commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3062332255 I'm curious about how this works for the Dremio ODBC driver (either with or without Flight); does it wait until it's read all the chunks before it reports the schema of the result set? The only somewhat-analogous thing I've seen is with Snowflake `NUMBER(N, 0)` columns where an Arrow-formatted data chunk might be represented as a number of lower precision. But the reported schema always contains the maximum precision and so the driver just needs to cast columns in individual chunks in order to assemble the final output. (And there's no FlightSQL in this picture; just Arrow-formatted data.) There's also some overlap with schema evolution in formats like Delta, where columns in the individual Parquet files don't have to match exactly the declared schema of the table. But they do have to be compatible with it, so a table with an `int` column can't have a Parquet file where that column is an `int64`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
lidavidm commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3062144737 As for decimal precision: there is no real solution, unless the client tries to cast everything to a common type. It may not always be able to determine a common type either, except perhaps by trying to get the schema from every worker first (and even then there may not be a common type possible depending on the precision/scale requirements), which would potentially tie up a lot of server resources and require some consideration of backpressure management. If you are talking about a Flight SQL client having to potentially dynamically unify _any_ parameterized/complex type then it gets more complicated, if we have to unify structs, timestamp units, and so on. (And if we're unifying structs, does that mean we synthesize null columns on the fly, if there's different field names for say list children, which field name gets picked, etc.) It may still be possible as an option (most of this functionality should be available in arrow-go already) but I'd like to see what @zeroshade thinks (should we move this to the mailing list to put in front of the other Flight SQL users?). Again, Dremio designed Flight SQL so I'm surprised they (you?) are running into these problems...I'm curious why Dremio apparently doesn't have a consistent schema for a single query/why the Dremio workers can't perform the required casting. I always saw Flight SQL as moving towards a relatively thin client (with most of the "smarts" focusing on retries etc.), and this pushes it back towards requiring a thick client (and at the very least, requiring an Arrow implementation capable of various casts and data transforms and not just a memory format implementation). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
vandop commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3062049410 Nope, it is not only nullability, precision and potential other complex types. My concern being the Inferred Schema from GetFlightInfo and real schema being two different paths which eventually can surface these inconsistencies. While in Dremio we may be able to "make it better", I was wondering if this is a Dremio only thing. Also, as much as I can work around it be it from custom clients or Dremio side, there are production implications for environments we don't control. As PowerBI is using the Go ADBC driver for instance, so there the impact will be a bit harder to work around (unless they/we fork the driver, given we can only forward fix the server). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
lidavidm commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3061734485 You could work around this by using ExecutePartitions and then directly invoking a Flight client to read each individual stream yourself -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]
lidavidm commented on issue #3134: URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3061733405 Is this only nullability mismatches? Or is it other things, too? For nullability, I guess I wonder why Dremio doesn't just always return all fields and data as "nullable" up front to avoid this issue (and Dremio designed Flight SQL!) It would be rather concerning to simply ignore nullability, too. If we were to add a flag, it would be (IMO) a statement option that casts all fields (and child fields?) to `nullable`, and not something that just ignores nullability (otherwise you may end up with nulls in a schema that is declared not null) (but this may get wonky in the general case: for some types, like maps, some child fields must not be nullable, and possibly changing the nullability of an extension field is also dangerous). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
