Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

via GitHub Mon, 02 Sep 2024 14:09:11 -0700


paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2325302884

@ychernysh, thank you for your detailed explanation. Let's focus in on one
point.

> The assignment happens in Foreman at parallelization phase
([Foreman#runPhysicalPlan:416](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L416),

[AbstractParquetGroupScan#applyAssignments](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java#L170-L173))
and requires the files metadata to be known at that phase (we need to know
what row groups are there in order to assign them to the minor fragments).

It is surprising that none of the "second generation" Drill developers ever
knew about, or mentioned that Drill scans Parquet files at plan time. Of
course, it could be that I just never understood what someone was saying. We
used to wrestle with inconsistent schemas all the time, so it is surprising if
the solution was available the whole time. That's why, if this code exists, I
suspect it must have been added ether very early (by a "first generation"
developer who later left) or within the last few years.

Another reason it is surprising that we have such code is the big deal we
make of being "schema free." Of course, "schema free" has problems. Why would
we not have mentioned that "schema free" means "infer the schema at plan time"
if doing so would solve the schema inconsistency issues? Amazing...

If such code exists, then it should have been integrated not just into
parallelization planning, but also Calcite type propagation, and the schema
included in the physical plan sent to the Parquet readers. I suppose whoever
added it could have just been focused on parallelization, and hoped Drill's
"magic" would handle the schema. In fact, the "missing" type propagation code
is very code that you're now adding, though, it seems, without using Calcite
for the type propagation.

The discussion we are having depends _entirely_ on whether schema
information is available at plan time. Before I comment further, you've given
me so me homework: I'll look at that code to determine if it really does scan
all the file headers at plan time.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns (drill)

Reply via email to