paul-rogers commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2318972159
@rymarm, thanks for this fix. I'm a bit rusty on Drill, but let's see what I can sort out. This stuff is complex, so I'm going to throw a wall of text at you so we get on the same page. First some background. If memory serves, the Parquet reader was written quickly by an intern early on: it is so complex that few have summoned the effort to understand or improve it. So, digging into it is a classic "legacy code" experience. In the early days, every single format reader created its own way to write data to vectors. This lead to all manner of bugs (getting the code right once is hard, doing it a dozen times is impossible). It also resulted in inconsistent behavior. To solve these (and to avoid Drill's memory fragmentation issues of the time), we created EVF (extended vector framework) to unify how we write to vectors and how we solve schema consistency issues for readers. EVF replaced the older column writer mechanisms. By now, all Drill readers _except Parquet_ are based on EVF. However, the Parquet reader is special: it is the only one that still uses the original, fragile mechanisms. As the story goes, Parquet was written very quickly by an intern, and the result was so complex that few people have been brave enough to try to sort out the code. Since Parquet still uses the old mechanisms, it has its own way to solve schema issues, its own way to handle unprojected columns, and still suffers from the bugs in the original, fragile mechanisms. It looks like your PR tries to fix some of these. In particular, it may be that that you're trying to fix a bug in Parquet that EVF solves for other readers. It would be great if your fix is consistent with EVF. I'll try to check this when I review the code. What we really need is for someone to take on a "weekend project" to rewrite the Parquet reader to use EVF so we have one complete schema mechanism rather than two inconsistent versions. (I can dream.) Now let's look at the bug in question. You received a schema change error. This means that some operator in the DAG saw two different schemas for the same table. In particular, the SORT operator can't handle the case of, say, `(a: INT, b: VARCHAR, c: double)` and `(a: INT, b: VARCHAR)` or` (a: INT: b: INT)`, even if our query only has `...ORDER BY a`. There was discussion about using `UNION` columns, or dynamically reformatting data. But, all of that was far too complex (and slow) to actually build. Most DB and query tools impose a schema at read time. That is, if we have files that have undergone schema evolution, as above, we work out at plan time that the common schema is, say `(a: INT, b: VARCHAR, c: DOUBLE)`, with `c` being `OPTIONAL` (`NULLABLE` in standard DB parlance) since it appears in only some of the files. For example, Presto and Impala avoid the issue by reading into the correct common schema in the first place, rather than cleaning up the mess later in each downstream operator. Of course, Drill's claim to fame is that it is schema-free: it works out the schema on the fly. That's all cool, except that Drill also supports operators that only work for a single schema. (Joins have the same issue.) That is, we promote a really cool feature, but we can't actually make it work. This is great marketing, but causes frustration in reality. And, as befits any bit of software that has a few miles on its tires, Drill has multiple ways to work around the schema issue. There is a "provided schema" feature that tells readers the common schema. The reader's job is then to map the actual file columns into that schema. The provided schema is supported in EVF, but not, alas, by the Parquet reader. Also, the planner never got around to supporting schemas, so provided schemas only work in rather special cases. There is also a Drill Metastore feature that emulates the Hive Metastore (HMS), but be in common use. The original way to manage schemas, for Parquet only, is to scan all the files and build up a JSON file that contains the union of all the schemas. I can't recall if this mechanism was supposed to provide type consistency, but it certainly could: we just look at the `c` column in all schemas of all Parquet files in a directory tree and work out a common type. We do this at plan time and send that type along to Parquet in the physical plan. I didn't work in that area, so my knowledge here is a bit light. It's worth a check. Now, one might say, hold on, isn't there an easy answer? If we read file A and get one schema, then read file B and get another schema, can't we blend them on the fly? Indeed, in some cases this is possible, and EVF implements those cases. However as I'm fond of saying, "Drill can't predict the future", so there are cases where we get it wrong. For example, the query asks for columns `(a, b, c)`. The first file has `(a, b)` and creates a dummy column for `c` that will hold all `NULL`s. But, of what type? Drill traditionally chooses `INT`. Then, we read file 2 that has `(a, b, c: VARCHAR)`. Now we realize that `c` should have been `VARCHAR`, but we've sent a batch of rows downstream already with the "wrong" `INT` type. Plus, remember, Drill is distributed. Those two files might have been read by different fragments running on different machines. They can't communicate until the rows come together in the sort, after which it is too late to fix the problem. One other tricky item to remember: Drill is fastest when columns are `REQUIRED` (i.e. non-nullable or `NOT NULL`). This avoids the `NULL` check on each column operation. **BUT**, Drill also allows schema to vary, and we fill in missing columns with `NULL` values. So, unless we have some guarantee that column `c` actually exists in all files read by a query, we are pretty much forced to make `c` be `OPTIONAL` (nullable), even when reading a file that contains `c` as a `NOT NULL` column. Again, Drill can't predict the future, so we can't know that some future file (in, perhaps, some other fragment on some other machine) will read a file that doesn't have `c`. There is a reason that SQL DBs, for 50+ years, have required a schema. It isn't that all those folks were backward, it is that SQL doesn't work without a common schema. (Apache Druid has a similar problem, but it solves it by treating all values as, essentially, untyped: the values change type on the fly as needed.) When faced with bugs of the kind here, it is important to sort out which are just "bad code" bugs and which are the "this design just can't work" bugs. Now, with that background, we can try to sort out the problem you are trying to solve. Does your test case have Parquet files with differing column sets or types? If so, let's identify exactly how the schemas differ, then we can discuss possible solutions. Looking at the specific line you pointed out, I'll hazard a guess as to what it is doing: that case we discussed above. Suppose our SQL is `SELECT a, b FROM 'my-dir'`. Suppose `my-dir` is a directory that contains two files. `file1.parquet` contains column `(a: VARCHAR)` and `file2.parquet` contains `(a: VARCHAR, b: VARCHAR)`. Both are read by the same reader (the scan is not distributed in this case.) File read order is random. Suppose we read them in the order `(file2, file1)`. We can set up a schema of `(a: VARCHAR, b: VARCHAR)`. When we read `file1`, we notice that 'b' is missing, but that, when it appeared previously, it was `VARCHAR`, so we keep that type and fill with nulls. (This is what EVF does, I'll speculate that Parquet's reader does something similar.) Great. What if we read the other order? We see we want `(a, b)`, but we have only` (a: VARCHAR)`, so we create that vector. What about `b`? Well, we helpfully create `(b: OPTIONAL INT)`. That is exactly what the code t hat you pointed to does. When we read `file2`, we see that `b` has "changed schema" to `VARCHAR` so we throw away the `(a: VARCHAR, b: INT)` schema and start reading with `(a: VARCHAR, b: VARCHAR)`. This then blows up the SORT. Given this, I don't think the problem is with the column name (which is what that referenced code change handles). The code change in question allowed handling a column name of the form `foo.a` where `foo` is not a MAP, with `a` as a member, but just a (poorly chosen) name of a column. That is, the problem is probably not that the test Parquet file columns are actually named `foo.a` and `foo.b` (with dots). You can try changing the code and rerunning the test, but I suspect that the problem won't go away unless you happen to run a test that reverses the file read order. Instead, the problem may be with the part of that code that did not change: Drill is trying to predict the future and predicting that when the missing column appears, it will be `INT`. It may be that Drill is predicting incorrectly. We need to use the "crystal ball" module to improve the prediction. Sadly, however, we never got around to writing that module. Hence the multiple attempts to manage schemas globally. Where does this leave us? If you can pin things down to one very specific case, we can sort out if it is a "bad code" bug or a "that just won't work given Drill's design" bug. In particular, reading Parquet files with inconsistent schemas, projecting the inconsistent columns, and adding a SORT won't work unless the missing columns will be of type `INT` when they appear. You can get lucky with file read order and work distribution so that, sometimes, it does work. Relying on luck produces flaky tests, however. On the other hand, you can have as much inconsistency as you want as long as the columns you project appear in all files and the type of those columns stays the same. Feel free to add as many other, inconsistent, columns as you like: just don't project them in queries with a SORT. I'd suggest that, since Drill doesn't handle Parquet schema changes well (though that is Drill's compelling feature), we maybe should not test stuff that can't actually work. Test with files with a consistent schema instead. Or, if files have an inconsistent schema, test with the Parquet schema feature enabled and after doing a scan of all the files. (I forget the command: `ANALYZE` maybe?) It _may_ be that the Parquet schema feature unifies column types, but then, given the haste with which it was written way back when, maybe it doesn't. Or, to be more modern, test with the Drill Metastore enabled. This stuff is horribly complex, and I may have missed the mark on this particular bug. But, at least we're now on the same page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org