jnturton commented on issue #2421: URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751568
Paul Rogers wrote: One last note. Let's assume we wanted to adopt the row-based format (or, the myths being strong, we want to adopt Arrow.) How would we go about it? The "brute force" approach is to rewrite all the operators. Must deal with low-level vector code, so we'd rewrite that with low-level row (or Arrow) code. Since we can't really test until all operators are converted, we'd have to do the entire conversion in one huge effort. Then, we get to debug. I hope this approach is setting off alarm bells: it is high cost and high risk. This is why Drill never seriously entertained the change. But, there is another solution. The scan readers all used to work directly with vectors. (Parquet still does.) Because of the memory reasons explained above, we converted most of them to use EVF. As a result, we could swap vectors for row pages (or Arrow) by changing the low-level code. Readers would be blissfully ignorant of such changes because the higher-level abstractions would be unchanged. So, a more sane way to approach a change of in-memory representations is to first convert the other operators to use an EVF-like approach. (EVF for writing new batches, a "Result Set Loader" for reading exiting batches.) Such a change can be done gradually, operator-by-operator, and is fully compatible with other, non-converted operators. No big bang. Once everything is upgraded to EVF, then we can swap out the in-memory format. Maybe try Arrow. Try a row-based format. Run tests. Pick the winner. This is *not* a trivial exercise, but it is doable over time, if we see value and can muster the resources. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
