paul-rogers commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2322288590
@ychernysh, thanks for the clarification of the column name issue. I agree that the name stored in the vector should not contain backticks. I am unaware of any code in the Foreman that scans all the Parquet files. Is that something new? Doing that in the Foreman places extreme load on the Foreman, causes remote reads in Hadoop, and will be slow. Drill relies on statistical averages to balance load: each node does about the same amount of work. Doing a large prescan in the Foreman invalidates this assumption. The feature is fine for Drill run as a single process on a single machine for a single user. It will cause hotspots and resource issues on a busy, distributed cluster. The old, original Parquet schema cache did, in fact, read all the files in a folder, but did so in a distributed fashion, and wrote the result to a JSON file. (But without locking, so that two updates at the same time led to corruption.) Are we talking about a new feature recently added in the last few years? Or, about the original Parquet schema cache? If this feature does exist, then we now have six ways that Drill handles schema (Parquet cache, provides schema, Drill Metastore, HMS, on-the-fly, and this new prescan of Parquet files). If a schema is available, then I agree with your analysis. The planner should detect column type conflicts, and if a column is missing. If a column is missing in any file, then the schema sent to the readers should be OPTIONAL for that column. Further, if some Parquet files have REQUIRED, and some OPTIONAL, then the mode should be OPTIONAL. You know the drill (pardon the pun). Note, however, that deciding on a common schema is a **planner** task. With EVF, the planner would provide the schema (using the provided schema mechanism) to the reader and EVF would "do the right thing" to ensure that the reader's output schema matches that defined by the planner. There are many CSV file unit tests that show this in action. Calcite does a fine job working out the required types assuming that Drill provides column type information in its metadata. Does the new Parquet prescan feed into Calcite, or is it something done on the side? If on the side, then we'll run into issues where Calcite guesses one column type (based on operations) but the Parquet schema prescan works out some other type (based on the Parquet files.) For example, if I have a query `SELECT SQRT(foo)...`, Calcite will guess that the column has to be numeric. But, if the Parquet prescan sees that `foo` is `VARCHAR`, then there will be a runtime conflict rather than the planner sorting out the mess. Again, I'm ignorant of the prescan feature, so I don't know what might have been done there. Thus, the task for Parquet should be to recreate what EVF does (since EVF already fought these battles), but using the older column writers. That is: * The Parquet reader has to properly handle the missing column to avoid a schema change. In Drill, "schema change" means "rebind the vectors and regenerate any operator-specific code." * If the first file does not have the column, create a dummy column of the type specified by the planner. * If the first file does have a given column, create a vector _with the type specified by the planner_, not the type specified by Parquet. * If a subsequent file does not have the column, _reuse_ the prior vector, which should be of OPTIONAL mode. Note that Parquet must already have a way to reuse columns common to two files, else we'd get a schema change on each new file. (Sorry, my knowledge of the old column writers is rusty; it's been several years since I last looked at them.) In fact, given that Parquet is trying to solve the same problems as EVF, one might want to review the EVF unit tests and recreate those many cases using Parquet. "If one is ignorant of history one is doomed to repeat it." Thanks again for the explanations. Now that I better understand the problem (and am aware of that new Parquet prescan feature), I can do a better code review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org