Hi Dobes, You've run into the classic drawback of runtime schema inference: if Drill never sees a column value in its first sample, then it has no way to "predict the future" and guess what type will eventually show up. So, Drill guesses "nullable INT" which turns out to almost always be wrong.
Some record readers pick the type on the very first row (a sample size of 1.) The newer JSON reader we're working on uses the first batch (a few thousand rows) as its sample size. Still, if you request "points", the reader is obligated to provide a column even if has to make something up. So, it makes up "nullable INT." This is the "black swan" problem of inductive reasoning: no matter how many empty values Drill sees, there could always be a non-empty value of some other type. Worse, one scan may see no value and choose "nullable INT" while another sees the actual value and chooses Float8. Now, some poor exchange receiver operator will see both types and have no clue what to do. This is why most DBs require a metastore (AKA data dictionary) to provide table descriptions. Instead of infering types, DBs define the types, often via the same spec that drives the generative process that created the data. Drill also has relatively new "provided schema" feature that helps with this issue in some (but not all) format plugins. But, it has not yet been added to Mongo (or any other storage plugin other than the file system plugin.) You could try a conditional cast: something like IF(sqlTypeOf(points) = `INT`, CAST(NULL AS FLOAT4), points) (I probably have the syntax a bit wrong.) This works if two different scans see the different types. But, it will fail if a single scan sees an empty value followed by a null value (which is exactly the case you describe) because the scan is trying to cope with the data before its even gotten to the Project operator where the IF would be applied. Sorry for the long post, but this is a difficult issue that has frustrated users for years. I recently posted a proposed solution design at [1] and would welcome feedback. Thanks, - Paul [1] https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model On Tuesday, February 25, 2020, 5:27:01 PM PST, Dobes Vandermeer <[email protected]> wrote: Hi, I was experimenting with the mongo storage system and I found that when I query a field that doesn't usually have any value, I get this error "You tried to write a Float8 type when you are using a ValueWriter of type NullableIntWriterImpl." Based on a bit of googling I found that this means drill has inferred the incorrect type for that field. I was hoping I could override the inferred type using CAST or something, but CAST didn't work. Is there a way to tell drill what type a field from mongodb is supposed to be? Example query: SELECT _id, CAST(points AS DOUBLE) FROM mongo.formative.answers AS answer WHERE answer.createdAt > DATE_SUB(current_timestamp, interval '1' day) LIMIT 100 In this case "points" isn't set on every row, so I guess drill assumes it is "NullableInt" when really it is should be considered a double. We also have many boolean fields that are not set by default that we would want to query. What's the standard workaround for this case?
