paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2322376645

   > @paul-rogers @ychernysh I'm wondering if it wouldn't be worth it to 
refactor the Parquet reader to use EVF2 rather than debug all this. I don't 
know what was involved, but I do know that refactoring all the other plugins to 
use EVF2 wasn't all that difficult, but I do know that Parquet is another ball 
game.
   
   Doing so would be the ideal solution. The challenge has always been that the 
Parquet reader is horribly complex. I took a few cracks a refactoring it back 
in the day, but it is still pretty complex.
   
   The most challenging issue is that Parquet is parallel: it reads each column 
in a separate thread. MapR did a large number of hacks to maximize parallelism, 
so that code grew quite complex with many nuances to saturate threads while not 
using too many resources overall: that is, to maximize parallelism within a 
single query, but also across the "thousands of concurrent queries" that were 
the rage back in the day.
   
   All other readers are row-based since that is how most other data formats 
work. EVF is a row-based implementation. As a result, EVF would be difficult to 
reuse.
   
   This raises the reason that EVF was created in the first place: limiting 
batch size to prevent memory fragmentation. Back in the day, all readers read 
64K records per batch, even if that resulted in huge vectors. EVF imposes a 
batch size limit, and gracefully wraps up each batch, rolling over any "excess" 
data to the next one.
   
   In Parquet, that logic does not exist. That is, if we have, say, 20 column 
writers all busy building their own vectors, there is nothing to say, "hold, 
on, we're over our 16 MB batch size limit". Instead, the readers just read _n_ 
rows, creating whatever size vectors are required. Read 1000 columns of 1 MB 
each and you need a 1GB value vector.
   
   The memory fragmentation issue arises because Drill's Netty-based memory 
manager handles allocations up to 32 MB (IIRC) from its binary-buddy free list. 
Beyond that, every request comes from the OS. Netty does not release memory 
back to the OS if even a single byte is in use from a 32 MB block. Eventually, 
all memory resides in the Netty free list, and we reach the OS allocation 
limit. As a result, we can have 100% of the Netty pool free, but no OS capacity 
to allocate another 64MB vector and we get an OOM. The only recourse is to 
restart Drill to return memory to the OS.
   
   While we long ago fixed the fragmentation issues in the other readers (via 
EVF) and other operators (using the "temporary" "batch sizer" hack), it may be 
that Parquet still suffers from memory fragmentation issues because of its 
unique, parallel structure.
   
   Still, perhaps there is some way to have EVF handle the schema and vector 
management stuff, but to disable the row-oriented batch size checks and let the 
Parquet readers write as much data as they want to each vector (fragmenting 
memory if it chooses to do so). Or, maybe work out some way to give each column 
reader a "lease" to read up to x MB. EVF can handle the work needed to copy 
"extra" data over to a new vector for the next batch. I'd have to try to swap 
EVF knowledge back into the ole' brain to sort this out.
   
   All this said, I can certainly see the argument for hacking the existing 
code just to get done. I guess I'd just suggest that the hacks at least reuse 
the rules we already worked out for EVF, even if they can't reuse the code.
   
   All of this is premised on the notion that someone did, recently, add a 
"Parquet prescan" to the planner, and that someone added column type 
propagation to the Calcite planner. Was this actually done? Or, are we somehow 
assuming it was done? Are we confusing this with the old Parquet schema cache? 
Again, I've been out of the loop so I'm just verifying I'm understanding the 
situation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to