ychernysh commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2320908961
Hi @paul-rogers , Thanks for your comments! Let me clarify some points here: > Given this, I don't think the problem is with the column name (which is what that referenced code change handles). The thing is that regardless of whether we did or did not succeed with "guessing" the major type for the missing column, the solution won't work until we solve the backticks problem, because essentially instead of creating _the missing column_ (unquoted), we currently create _a brand new column_ (quoted) which has nothing to do with the "real" missing one. Please see my `ORDER BY` example in [DRILL-8507](https://issues.apache.org/jira/browse/DRILL-8507) and the error from there: ``` Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External Sort. Please enable Union type. Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], selectionVector=NONE] Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` (INT:OPTIONAL)]], selectionVector=NONE] ``` Note that all the major types here are guessed correctly (`INT:OPTIONAL`), but we fail because of different field counts. This happens because in the incoming schema we have 2 fields: quoted and unquoted, while it is supposed to have only one (unquoted, based on previous schema). This is the reason I reported the issue as 2 separate jiras and divided the PR into separate commits. So before proceeding to any type guessing logic, we have to get rid of the quoting problem first. > Also, note that this fix works ONLY in one direction (column appears, then disappears), and ONLY within a single thread: it can't solve the same problem if the two files are read in different threads and sent to the SORT to reconcile. Thanks to [AbstractParquetRowGroupScan's schema](https://github.com/apache/drill/blob/5a3b1785c119a5e3986b6cb269a48ba76db94a3b/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetRowGroupScan.java#L46), which this (DRILL-8508) solution is based on, all readers in all fragments on all machines are aware about the overall (constructed from all parquet files to read) table schema. Foreman scans the footers of all the files and merges a single schema from them back in the planning phase and then sends it to each fragment. This is exactly what gives us the opportunity to "see the future" at reading phase, because all the metadata (even for future files) is already available and merged into a single schema. With that, we are not just _propagating the old column type_, but we are propagating _the certainly correct_ column type. And we **are sure** the type is correct because it was resolved back in the planner. So, in fact, the fix works in ALL directions and within ALL threads and machines. > Further, we are changing the mode to OPTIONAL as required so we can fill the vector with NULL values. However, change of mode (i.e. nullability) is a schema change and will cause the SORT to fail. We have to have known, on the previous file, that the column will be missing in this file, so that we can create the original column as OPTIONAL. Exactly, and that is why I added this _enforcing OPTIONAL_ logic. That is, even if the particular parquet reader is going to read REQUIRED parquet column, we enforce it to put it in NULLABLE value vector to get consistent schema with missing column in some file. Note that we are able to so, because _we know at the reading phase that the column is partially missing_ thanks to the aforementioned schema propagated to all readers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org