paul-rogers commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2322376645
> @paul-rogers @ychernysh I'm wondering if it wouldn't be worth it to refactor the Parquet reader to use EVF2 rather than debug all this. I don't know what was involved, but I do know that refactoring all the other plugins to use EVF2 wasn't all that difficult, but I do know that Parquet is another ball game. Doing so would be the ideal solution. The challenge has always been that the Parquet reader is horribly complex. I took a few cracks a refactoring it back in the day, but it is still pretty complex. The most challenging issue is that Parquet is parallel: it reads each column in a separate thread. MapR did a large number of hacks to maximize parallelism, so that code grew quite complex with many nuances to saturate threads while not using too many resources overall: that is, to maximize parallelism within a single query, but also across the "thousands of concurrent queries" that were the rage back in the day. All other readers are row-based since that is how most other data formats work. EVF is a row-based implementation. As a result, EVF would be difficult to reuse. This raises the reason that EVF was created in the first place: limiting batch size to prevent memory fragmentation. Back in the day, all readers read 64K records per batch, even if that resulted in huge vectors. EVF imposes a batch size limit, and gracefully wraps up each batch, rolling over any "excess" data to the next one. In Parquet, that logic does not exist. That is, if we have, say, 20 column writers all busy building their own vectors, there is nothing to say, "hold, on, we're over our 16 MB batch size limit". Instead, the readers just read _n_ rows, creating whatever size vectors are required. Read 1000 columns of 1 MB each and you need a 1GB value vector. The memory fragmentation issue arises because Drill's Netty-based memory manager handles allocations up to 32 MB (IIRC) from its binary-buddy free list. Beyond that, every request comes from the OS. Netty does not release memory back to the OS if even a single byte is in use from a 32 MB block. Eventually, all memory resides in the Netty free list, and we reach the OS allocation limit. As a result, we can have 100% of the Netty pool free, but no OS capacity to allocate another 64MB vector and we get an OOM. The only recourse is to restart Drill to return memory to the OS. While we long ago fixed the fragmentation issues in the other readers (via EVF) and other operators (using the "temporary" "batch sizer" hack), it may be that Parquet still suffers from memory fragmentation issues because of its unique, parallel structure. Still, perhaps there is some way to have EVF handle the schema and vector management stuff, but to disable the row-oriented batch size checks and let the Parquet readers write as much data as they want to each vector (fragmenting memory if it chooses to do so). Or, maybe work out some way to give each column reader a "lease" to read up to x MB. EVF can handle the work needed to copy "extra" data over to a new vector for the next batch. I'd have to try to swap EVF knowledge back into the ole' brain to sort this out. All this said, I can certainly see the argument for hacking the existing code just to get done. I guess I'd just suggest that the hacks at least reuse the rules we already worked out for EVF, even if they can't reuse the code. All of this is premised on the notion that someone did, recently, add a "Parquet prescan" to the planner, and that someone added column type propagation to the Calcite planner. Was this actually done? Or, are we somehow assuming it was done? Are we confusing this with the old Parquet schema cache? Again, I've been out of the loop so I'm just verifying I'm understanding the situation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org