paul-rogers commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2325302884
@ychernysh, thank you for your detailed explanation. Let's focus in on one point. > The assignment happens in Foreman at parallelization phase ([Foreman#runPhysicalPlan:416](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L416), [AbstractParquetGroupScan#applyAssignments](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java#L170-L173)) and requires the files metadata to be known at that phase (we need to know what row groups are there in order to assign them to the minor fragments). It is surprising that none of the "second generation" Drill developers ever knew about, or mentioned that Drill scans Parquet files at plan time. Of course, it could be that I just never understood what someone was saying. We used to wrestle with inconsistent schemas all the time, so it is surprising if the solution was available the whole time. That's why, if this code exists, I suspect it must have been added ether very early (by a "first generation" developer who later left) or within the last few years. Another reason it is surprising that we have such code is the big deal we make of being "schema free." Of course, "schema free" has problems. Why would we not have mentioned that "schema free" means "infer the schema at plan time" if doing so would solve the schema inconsistency issues? Amazing... If such code exists, then it should have been integrated not just into parallelization planning, but also Calcite type propagation, and the schema included in the physical plan sent to the Parquet readers. I suppose whoever added it could have just been focused on parallelization, and hoped Drill's "magic" would handle the schema. In fact, the "missing" type propagation code is very code that you're now adding, though, it seems, without using Calcite for the type propagation. The discussion we are having depends _entirely_ on whether schema information is available at plan time. Before I comment further, you've given me so me homework: I'll look at that code to determine if it really does scan all the file headers at plan time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org