Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-21 Thread via GitHub


lidavidm commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3095887811

   We have a path for handling GetFlightInfo not returning a schema, so 
hopefully it's a small-ish change to opt to always ignore it:
   
   
https://github.com/apache/arrow-adbc/blob/ebb2fd09adf9bc8d5a56846f1b53be15490abc25/go/adbc/driver/flightsql/record_reader.go#L86-L98


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-21 Thread via GitHub


vandop commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3095833403

   Yap, agree we are fixing it on Dremio as well. 
   
   Given GetFlightInfo may not return a schema (being it optional), enforcing 
it as the source of truth, maybe a stretch, especially if the subsequent DoGet 
calls can get the "real" schema and ensure it is consistent among the many 
calls. 
   
   I'll try to take a stab on the change and check if it is small enough, after 
we have the fix. If becomes cumbersome I'll leave it as is so it doesn't make 
it vendor specific complexity.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-20 Thread via GitHub


lidavidm commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3095269006

   I think a PR to add an option to wait for the DoGet to get the schema might 
be OK then, depending on how complex that change is. Anything more complicated 
and I think we would want to fork a Dremio-specific driver. But I would still 
prefer that Dremio fix this on their end :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-14 Thread via GitHub


vandop commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3069411747

   > To be clear: does this mean that Dremio is at least consistent between 
different workers for DoGet calls?
   
   Yes.
   
   The problem is only between GetFlightInfo where Dremio does try to infer the 
schema without triggering an execution of the query vs DoGet where the actual 
execution happens and the real schema is returned. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-14 Thread via GitHub


lidavidm commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3069063622

   Because I think a lot of the concern here is that different workers could 
return different schemas. If it's just the case that the DoGet schema is always 
consistent and correct, well, I'd still question how this arises in Dremio (and 
whether it's actually guaranteed to be consistent...) but that seems like a 
much smaller tweak to make.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-14 Thread via GitHub


lidavidm commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3069057252

   > But answering the question about the rest of the drivers, no other driver 
JDBC AF/ODBC AF we have does enforce Schemas to be consistent across 
GetFlightInfo and DoGet calls.
   
   To be clear: does this mean that Dremio is at least consistent between 
different workers for DoGet calls? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-14 Thread via GitHub


vandop commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3069007912

   Thanks for the input. 
   I've done a bit more digging, seems internally there are still some edge 
cases we may be able to avoid it for most. 
   
   But answering the question about the rest of the drivers, no other driver 
JDBC AF/ODBC AF we have does enforce Schemas to be consistent across 
GetFlightInfo and DoGet calls. 
   
   In the meantime if this strictness is by design, then it makes sense we 
would fork the driver or fix the server. I was mostly surprised that different 
drivers seem to have different behavior in this strictness. 
   
   But maybe the fix is to work on the other drivers instead :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-11 Thread via GitHub


zeroshade commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3063654670

   > It may still be possible as an option (most of this functionality should 
be available in arrow-go already) but I'd like to see what @zeroshade thinks 
(should we move this to the mailing list to put in front of the other Flight 
SQL users?).
   
   Personally, my view is similar to @lidavidm in that I'm curious why there 
isn't a consistent schema in the first place that Dremio workers could cast to 
before output. Similar to @CurtHagenlocher I would like to know how the Dremio 
ODBC driver avoids this issue also, given that as far as I'm aware it is also 
using Arrow Flight/FlightSQL under the hood and thus would run into the same 
problem.
   
   The issue with Snowflake that was brought up is that internally, they do not 
maintain a min/max over the entire result set to know what the final Arrow 
schema should be up front, and instead just use the smallest precision for a 
given chunk. In Dremio's case you are already using Arrow internally throughout 
the entire system, so everything should in theory map easily to Arrow types 
enabling the planner and workers to easily know what to cast to.
   
   If this can't be done on the server side, I'd prefer we simply create a 
*dremio* ADBC driver which performs any necessary casting / handling of 
inconsistencies rather than pushing this onto the generic Flight SQL situation. 
As David said, Flight SQL shouldn't *require* a client to perform casting and 
transformations, it should just require handling the Arrow memory format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-11 Thread via GitHub


CurtHagenlocher commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3062332255

   I'm curious about how this works for the Dremio ODBC driver (either with or 
without Flight); does it wait until it's read all the chunks before it reports 
the schema of the result set?
   
   The only somewhat-analogous thing I've seen is with Snowflake `NUMBER(N, 0)` 
columns where an Arrow-formatted data chunk might be represented as a number of 
lower precision. But the reported schema always contains the maximum precision 
and so the driver just needs to cast columns in individual chunks in order to 
assemble the final output. (And there's no FlightSQL in this picture; just 
Arrow-formatted data.)
   
   There's also some overlap with schema evolution in formats like Delta, where 
columns in the individual Parquet files don't have to match exactly the 
declared schema of the table. But they do have to be compatible with it, so a 
table with an `int` column can't have a Parquet file where that column is an 
`int64`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-11 Thread via GitHub


lidavidm commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3062144737

   As for decimal precision: there is no real solution, unless the client tries 
to cast everything to a common type. It may not always be able to determine a 
common type either, except perhaps by trying to get the schema from every 
worker first (and even then there may not be a common type possible depending 
on the precision/scale requirements), which would potentially tie up a lot of 
server resources and require some consideration of backpressure management. 
   
   If you are talking about a Flight SQL client having to potentially 
dynamically unify _any_ parameterized/complex type then it gets more 
complicated, if we have to unify structs, timestamp units, and so on. (And if 
we're unifying structs, does that mean we synthesize null columns on the fly, 
if there's different field names for say list children, which field name gets 
picked, etc.) It may still be possible as an option (most of this functionality 
should be available in arrow-go already) but I'd like to see what @zeroshade 
thinks (should we move this to the mailing list to put in front of the other 
Flight SQL users?).
   
   Again, Dremio designed Flight SQL so I'm surprised they (you?) are running 
into these problems...I'm curious why Dremio apparently doesn't have a 
consistent schema for a single query/why the Dremio workers can't perform the 
required casting. I always saw Flight SQL as moving towards a relatively thin 
client (with most of the "smarts" focusing on retries etc.), and this pushes it 
back towards requiring a thick client (and at the very least, requiring an 
Arrow implementation capable of various casts and data transforms and not just 
a memory format implementation).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-11 Thread via GitHub


vandop commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3062049410

   Nope, it is not only nullability, precision and potential other complex 
types. 
   
   My concern being the Inferred Schema from GetFlightInfo and real schema 
being two different paths which eventually can surface these inconsistencies. 
   
   While in Dremio we may be able to "make it better", I was wondering if this 
is a Dremio only thing. 
   
   Also, as much as I can work around it be it from custom clients or Dremio 
side, there are production implications for environments we don't control. As 
PowerBI is using the Go ADBC driver for instance, so there the impact will be a 
bit harder to work around (unless they/we fork the driver, given we can only 
forward fix the server).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-11 Thread via GitHub


lidavidm commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3061734485

   You could work around this by using ExecutePartitions and then directly 
invoking a Flight client to read each individual stream yourself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] GetFlightInfo vs DoGet schema enforcement strictness [arrow-adbc]

2025-07-11 Thread via GitHub


lidavidm commented on issue #3134:
URL: https://github.com/apache/arrow-adbc/issues/3134#issuecomment-3061733405

   Is this only nullability mismatches? Or is it other things, too?
   
   For nullability, I guess I wonder why Dremio doesn't just always return all 
fields and data as "nullable" up front to avoid this issue (and Dremio designed 
Flight SQL!)
   
   It would be rather concerning to simply ignore nullability, too. If we were 
to add a flag, it would be (IMO) a statement option that casts all fields (and 
child fields?) to `nullable`, and not something that just ignores nullability 
(otherwise you may end up with nulls in a schema that is declared not null) 
(but this may get wonky in the general case: for some types, like maps, some 
child fields must not be nullable, and possibly changing the nullability of an 
extension field is also dangerous).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]