Hi Dobes, Thanks for the idea: a text-only approach would certainly solve one class of problems. As it turns out, CSV does read all columns as text.
The solution does not solve two other problems which have popped up from time to time. First, if file A contains columns a and b, and file B contains columns a, b, and c, then the CSV reader for A does not know to read column c. Some other operator in Drill will fill in the column, and will choose Nullable INT as the type. That could, however, be changed to choose Nullable Varchar. Second, if a batch of A columns is given to the client before Drill sees the first file B rows, then the client will see a schema of (a, b) followed by a schema of (a, b, c.) JDBC and ODBC clients can't handle this. The REST API handles this, but at the cost of buffering all rows in memory before sending, which causes its own issues, as someone recently noted. JSON does have an all-text mode which suffers from the same missing-column issue. Since Mongo uses JDBC, it should also support all-text mode. Mongo is a bit tricky because it allows an object syntax to specify scalars: {type: "int", value: 10} (IIRC). Drill does not read such values as strings, even in all-text mode. The result can be a type conversion error if some values use the extended from, others use the simple form. So, Mongo would need special handling. There was a start at fixing this before our committer reviewers dried up. We could dust off the work and finish it up. The all-text approach avoids type conversion issues because there is no type conversion. Users, however, want to do math and other operations which require numeric types. So, an inconvenience with the all-text approach is that the user must include CASTs in every query to convert the data to the proper type. Doing so makes queries more complex and slower. You can work around the CAST problem by creating a view: the view contains all the needed CASTs. The user references the view instead of the actual file. Of course, if you are going to define a view, you might as well go all the way and use the other approach you mentioned, which Drill started to support: tell the CSV or JSON reader what schema to expect. That way, the reader does the conversion (as for JSON), and does so reliably. Since the schema is known at read time, all batches have the same schema, which solves the "schema change" problem. The key challenge is that, to solve these problems, Drill needs information not available in the raw data. The question is: what is the most reliable, least complex way to supply that information? All-text mode, with pre-defined conversions, and a list of columns to expect would provide Drill with the needed information. One way to gather the information would be to extend to do a "sniff" pass over the data to infer a schema across the set of files to scan, work out any inconsistencies, then do the real scan with a reliable, complete schema. Doing the "sniff" pass for every query is slow; it would be better to do the "sniff" pass once and reuse the information. Drill tries to do this with Parquet metadata. AWS recognized this same problem. The AWS Glue product, based on Hive, will "sniff" your S3 data to infer a schema which is then stored in HMS and available for tools to use. Drill could tie into Glue to obtain the information so that the user need not create it. So, several issues to consider and several ways to address them. It does look like the key challenge is what you identified: to provide Drill with information not available in that first file record. Thanks, - Paul On Mon, Oct 26, 2020 at 1:42 AM Dobes Vandermeer <dob...@gmail.com> wrote: > Currently drill tries to infer schemas from data that doesn't come with > one, such as JSON, CSV, and mongoDB. However this doesn't work well if the > first N rows are missing values for fields - drill just assigns an > arbitrary type to fields that are only null and no type to fields that are > missing completely, then rejects values when it finds them later. > > What if you could instead query in a mode where each row is just given as > a string, and you use JSON functions to load the data out and convert or > cast it to the appropriate type? > > For JSON in particular it's common these days to provide functions that > extract data from a JSON string column. BigQuery and postgres are two good > examples. > > I think in many cases these JSON functions could be inspected by a driver > and still be used for filter push > down. > > Anyway, just an idea I had to approach the mongo schema problem that's a > bit different from trying to specify the schema up front. I think this > approach offers more flexibility to the user at the cost of more verbose > syntax and harder to optimize queries. >