paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2325302884

   @ychernysh, thank you for your detailed explanation. Let's focus in on one 
point.
   
   > The assignment happens in Foreman at parallelization phase 
([Foreman#runPhysicalPlan:416](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L416),
 
[AbstractParquetGroupScan#applyAssignments](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java#L170-L173))
 and requires the files metadata to be known at that phase (we need to know 
what row groups are there in order to assign them to the minor fragments).
   
   It is surprising that none of the "second generation" Drill developers ever 
knew about, or mentioned that Drill scans Parquet files at plan time. Of 
course, it could be that I just never understood what someone was saying. We 
used to wrestle with inconsistent schemas all the time, so it is surprising if 
the solution was available the whole time. That's why, if this code exists, I 
suspect it must have been added ether very early (by a "first generation" 
developer who later left) or within the last few years.
   
   Another reason it is surprising that we have such code is the big deal we 
make of being "schema free." Of course, "schema free" has problems. Why would 
we not have mentioned that "schema free" means "infer the schema at plan time" 
if doing so would solve the schema inconsistency issues? Amazing...
   
   If such code exists, then it should have been integrated not just into 
parallelization planning, but also Calcite type propagation, and the schema 
included in the physical plan sent to the Parquet readers. I suppose whoever 
added it could have just been focused on parallelization, and hoped Drill's 
"magic" would handle the schema. In fact, the "missing" type propagation code 
is very code that you're now adding, though, it seems, without using Calcite 
for the type propagation.
   
   The discussion we are having depends _entirely_ on whether schema 
information is available at plan time. Before I comment further, you've given 
me so me homework: I'll look at that code to determine if it really does scan 
all the file headers at plan time.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to