Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

via GitHub Thu, 29 Aug 2024 13:54:41 -0700


paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2318972159


   @rymarm, thanks for this fix. I'm a bit rusty on Drill, but let's see what I 
can sort out. This stuff is complex, so I'm going to throw a wall of text at 
you so we get on the same page.
   
   First some background. If memory serves, the Parquet reader was written 
quickly by an intern early on: it is so complex that few have summoned the 
effort to understand or improve it. So, digging into it is a classic "legacy 
code" experience.
   
   In the early days, every single format reader created its own way to write 
data to vectors. This lead to all manner of bugs (getting the code right once 
is hard, doing it a dozen times is impossible). It also resulted in 
inconsistent behavior. To solve these (and to avoid Drill's memory 
fragmentation issues of the time), we created EVF (extended vector framework) 
to unify how we write to vectors and how we solve schema consistency issues for 
readers. EVF replaced the older column writer mechanisms. By now, all Drill 
readers _except Parquet_ are based on EVF.
   
   However, the Parquet reader is special: it is the only one that still uses 
the original, fragile mechanisms. As the story goes, Parquet was written very 
quickly by an intern, and the result was so complex that few people have been 
brave enough to try to sort out the code. Since Parquet still uses the old 
mechanisms, it has its own way to solve schema issues, its own way to handle 
unprojected columns, and still suffers from the bugs in the original, fragile 
mechanisms. It looks like your PR tries to fix some of these. In particular, it 
may be that that you're trying to fix a bug in Parquet that EVF solves for 
other readers. It would be great if your fix is consistent with EVF. I'll try 
to check this when I review the code.
   
   What we really need is for someone to take on a "weekend project" to rewrite 
the Parquet reader to use EVF so we have one complete schema mechanism rather 
than two inconsistent versions. (I can dream.)
   
   Now let's look at the bug in question. You received a schema change error. 
This means that some operator in the DAG saw two different schemas for the same 
table. In particular, the SORT operator can't handle the case of, say, `(a: 
INT, b: VARCHAR, c: double)` and `(a: INT, b: VARCHAR)` or` (a: INT: b: INT)`, 
even if our query only has `...ORDER BY a`. There was discussion about using 
`UNION` columns, or dynamically reformatting data. But, all of that was far too 
complex (and slow) to actually build. 
   
   Most DB and query tools impose a schema at read time. That is, if we have 
files that have undergone schema evolution, as above, we work out at plan time 
that the common schema is, say `(a: INT, b: VARCHAR, c: DOUBLE)`, with `c` 
being `OPTIONAL` (`NULLABLE` in standard DB parlance) since it appears in only 
some of the files. For example, Presto and Impala avoid the issue by reading 
into the correct common schema in the first place, rather than cleaning up the 
mess later in each downstream operator.
   
   Of course, Drill's claim to fame is that it is schema-free: it works out the 
schema on the fly. That's all cool, except that Drill also supports operators 
that only work for a single schema. (Joins have the same issue.) That is, we 
promote a really cool feature, but we can't actually make it work. This is 
great marketing, but causes frustration in reality.
   
   And, as befits any bit of  software that has a few miles on its tires, Drill 
has multiple ways to work around the schema issue. There is a "provided schema" 
feature that tells readers the common schema. The reader's job is then to map 
the actual file columns into that schema. The provided schema is supported in 
EVF, but not, alas, by the Parquet reader. Also, the planner never got around 
to supporting schemas, so provided schemas only work in rather special cases. 
There is also a Drill Metastore feature that emulates the Hive Metastore (HMS), 
but be in common use.
   
   The original way to manage schemas, for Parquet only, is to scan all the 
files and build up a JSON file that contains the union of all the schemas. I 
can't recall if this mechanism was supposed to provide type consistency, but it 
certainly could: we just look at the `c` column in all schemas of all Parquet 
files in a directory tree and work out a common type. We do this at plan time 
and send that type along to Parquet in the physical plan. I didn't work in that 
area, so my knowledge here is a bit light. It's worth a check.
   
   Now, one might say, hold on, isn't there an easy answer? If we read file A 
and get one schema, then read file B and get another schema, can't we blend 
them on the fly? Indeed, in some cases this is possible, and EVF implements 
those cases. However as I'm fond of saying, "Drill can't predict the future", 
so there are cases where we get it wrong. For example, the query asks for 
columns `(a, b, c)`. The first file has `(a, b)` and creates a dummy column for 
`c` that will hold all `NULL`s. But, of what type? Drill traditionally chooses 
`INT`. Then, we read file 2 that has `(a, b, c: VARCHAR)`. Now we realize that 
`c` should have been `VARCHAR`, but we've sent a batch of rows downstream 
already with the "wrong" `INT` type. 
   
   Plus, remember, Drill is distributed. Those two files might have been read 
by different fragments running on different machines. They can't communicate 
until the rows come together in the sort, after which it is too late to fix the 
problem.
   
   One other tricky item to remember: Drill is fastest when columns are 
`REQUIRED` (i.e. non-nullable or `NOT NULL`). This avoids the `NULL` check on 
each column operation. **BUT**, Drill also allows schema to vary, and we fill 
in missing columns with `NULL` values. So, unless we have some guarantee that 
column `c` actually exists in all files read by a query, we are pretty much 
forced to make `c` be `OPTIONAL` (nullable), even when reading a file that 
contains `c` as a `NOT NULL` column. Again, Drill can't predict the future, so 
we can't know that some future file (in, perhaps, some other fragment on some 
other machine) will read a file that doesn't have `c`.
   
   There is a reason that SQL DBs, for 50+ years, have required a schema. It 
isn't that all those folks were backward, it is that SQL doesn't work without a 
common schema. (Apache Druid has a similar problem, but it solves it by 
treating all values as, essentially, untyped: the values change type on the fly 
as needed.) When faced with bugs of the kind here, it is important to sort out 
which are just "bad code" bugs and which are the "this design just can't work" 
bugs.
   
   Now, with that background, we can try to sort out the problem you are trying 
to solve. Does your test case have Parquet files with differing column sets or 
types? If so, let's identify exactly how the schemas differ, then we can 
discuss possible solutions.
   
   Looking at the specific line you pointed out, I'll hazard a guess as to what 
it is doing: that case we discussed above. Suppose our SQL is `SELECT a, b FROM 
'my-dir'`. Suppose `my-dir` is a directory that contains two files. 
`file1.parquet` contains column `(a: VARCHAR)` and `file2.parquet` contains 
`(a: VARCHAR, b: VARCHAR)`. Both are read by the same reader (the scan is not 
distributed in this case.) File read order is random. Suppose we read them in 
the order `(file2, file1)`. We can set up a schema of `(a: VARCHAR, b: 
VARCHAR)`. When we read `file1`, we notice that 'b' is missing, but that, when 
it appeared previously, it was `VARCHAR`, so we keep that type and fill with 
nulls. (This is what EVF does, I'll speculate that Parquet's reader does 
something similar.) Great. What if we read the other order? We see we want `(a, 
b)`, but we have only` (a: VARCHAR)`, so we create that vector. What about `b`? 
Well, we helpfully create `(b: OPTIONAL INT)`. That is exactly what the code t
 hat you pointed to does. When we read `file2`, we see that `b` has "changed 
schema" to `VARCHAR` so we throw away the `(a: VARCHAR, b: INT)` schema and 
start reading with `(a: VARCHAR, b: VARCHAR)`. This then blows up the SORT.
   
   Given this, I don't think the problem is with the column name (which is what 
that referenced code change handles). The code change in question allowed 
handling a column name of the form `foo.a` where `foo` is not a MAP, with `a` 
as a member, but just a (poorly chosen) name of a column. That is, the problem 
is probably not that the test Parquet file columns are actually named `foo.a` 
and `foo.b` (with dots). You can try changing the code and rerunning the test, 
but I suspect that the problem won't go away unless you happen to run a test 
that reverses the file read order.
   
   Instead, the problem may be with the part of that code that did not change: 
Drill is trying to predict the future and predicting that when the missing 
column appears, it will be `INT`. It may be that Drill is predicting 
incorrectly. We need to use the "crystal ball" module to improve the 
prediction. Sadly, however, we never got around to writing that module. Hence 
the multiple attempts to manage schemas globally.
   
   Where does this leave us? If you can pin things down to one very specific 
case, we can sort out if it is a "bad code" bug or a "that just won't work 
given Drill's design" bug. In particular, reading Parquet files with 
inconsistent schemas, projecting the inconsistent columns, and adding a SORT 
won't work unless the missing columns will be of type `INT` when they appear. 
You can get lucky with file read order and work distribution so that, 
sometimes, it does work. Relying on luck produces flaky tests, however.
   
   On the other hand, you can have as much inconsistency as you want as long as 
the columns you project appear in all files and the type of those columns stays 
the same. Feel free to add as many other, inconsistent, columns as you like: 
just don't project them in queries with a SORT.
   
   I'd suggest that, since Drill doesn't handle Parquet schema changes well 
(though that is Drill's compelling feature), we maybe should not test stuff 
that can't actually work. Test with files with a consistent schema instead. Or, 
if files have an inconsistent schema, test with the Parquet schema feature 
enabled and after doing a scan of all the files. (I forget the command: 
`ANALYZE` maybe?) It _may_ be that the Parquet schema feature unifies column 
types, but then, given the haste with which it was written way back when, maybe 
it doesn't. Or, to be more modern, test with the Drill Metastore enabled.
   
   This stuff is horribly complex, and I may have missed the mark on this 
particular bug. But, at least we're now on the same page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

Reply via email to