Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

via GitHub Fri, 30 Aug 2024 13:27:14 -0700


paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2322288590


   @ychernysh, thanks for the clarification of the column name issue. I agree 
that the name stored in the vector should not contain backticks.
   
   I am unaware of any code in the Foreman that scans all the Parquet files. Is 
that something new? Doing that in the Foreman places extreme load on the 
Foreman, causes remote reads in Hadoop, and will be slow. Drill relies on 
statistical averages to balance load: each node does about the same amount of 
work. Doing a large prescan in the Foreman invalidates this assumption. The 
feature is fine for Drill run as a single process on a single machine for a 
single user. It will cause hotspots and resource issues on a busy, distributed 
cluster.
   
   The old, original Parquet schema cache did, in fact, read all the files in a 
folder, but did so in a distributed fashion, and wrote the result to a JSON 
file. (But without locking, so that two updates at the same time led to 
corruption.) Are we talking about a new feature recently added in the last few 
years? Or, about the original Parquet schema cache?
   
   If this feature does exist, then we now have six ways that Drill handles 
schema (Parquet cache, provides schema, Drill Metastore, HMS, on-the-fly, and 
this new prescan of Parquet files). If a schema is available, then I agree with 
your analysis. The planner should detect column type conflicts, and if a column 
is missing. If a column is missing in any file, then the schema sent to the 
readers should be OPTIONAL for that column. Further, if some Parquet files have 
REQUIRED, and some OPTIONAL, then the mode should be OPTIONAL. You know the 
drill (pardon the pun).
   
   Note, however, that deciding on a common schema is a **planner** task. With 
EVF, the planner would provide the schema (using the provided schema mechanism) 
to the reader and EVF would "do the right thing" to ensure that the reader's 
output schema matches that defined by the planner. There are many CSV file unit 
tests that show this in action.
   
   Calcite does a fine job working out the required types assuming that Drill 
provides column type information in its metadata. Does the new Parquet prescan 
feed into Calcite, or is it something done on the side? If on the side, then 
we'll run into issues where Calcite guesses one column type (based on 
operations) but the Parquet schema prescan works out some other type (based on 
the Parquet files.) For example, if I have a query `SELECT SQRT(foo)...`, 
Calcite will guess that the column has to be numeric. But, if the Parquet 
prescan sees that `foo` is `VARCHAR`, then there will be a runtime conflict 
rather than the planner sorting out the mess. Again, I'm ignorant of the 
prescan feature, so I don't know what might have been done there.
   
   Thus, the task for Parquet should be to recreate what EVF does (since EVF 
already fought these battles), but using the older column writers. That is:
   
   * The Parquet reader has to properly handle the missing column to avoid a 
schema change. In Drill, "schema change" means "rebind the vectors and 
regenerate any operator-specific code."
   * If the first file does not have the column, create a dummy column of the 
type specified by the planner.
   * If the first file does have a given column, create a vector _with the type 
specified by the planner_, not the type specified by Parquet.
   * If a subsequent file does not have the column, _reuse_ the prior vector, 
which should be of OPTIONAL mode.
   
   Note that Parquet must already have a way to reuse columns common to two 
files, else we'd get a schema change on each new file. (Sorry, my knowledge of 
the old column writers is rusty; it's been several years since I last looked at 
them.)
   
   In fact, given that Parquet is trying to solve the same problems as EVF, one 
might want to review the EVF unit tests and recreate those many cases using 
Parquet. "If one is ignorant of history one is doomed to repeat it."
   
   Thanks again for the explanations. Now that I better understand the problem 
(and am aware of that new Parquet prescan feature), I can do a better code 
review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

Reply via email to