[ https://issues.apache.org/jira/browse/DRILL-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers updated DRILL-5636: ------------------------------- Description: The external sort spills data to disk under memory pressure. The sort code uses a generic mechanism to do the spilling: * Use a "priority queue copier" to copy sorted records into a new batch * Spill the new batch by writing the vectors for the newly-created batch The above works fine when memory is plentiful. But, under low-memory conditions, the intermediate copy can cause OOM errors. An improved algorithm is: * Priority queue copier works vector-by-vector * Serialize each vector to disk * Release its memory * Repeat for the next vector The advantages of the above: * Less intermediate memory use * Perhaps better CPU cache performance through greater locality (all writes happen to a single vector at a time, rather than row by row) * No change in disk format or disk write performance (because data is buffered prior to write anyway.) was: The external sort spills data to disk under memory pressure. The sort code uses a generic mechanism to do the spilling: * Use a "priority queue copier" to copy sorted records into a new batch * Spill the new batch by writing the vectors for the newly-created batch The above works fine when memory is plentiful. But, under low-memory conditions, the intermediate copy can cause OOM errors. An improved algorithm is: * Priority queue copier works vector-by-vector * Serialize each vector to disk * Release its memory * Repeat for the next vector The advantages of the above: * Less intermediate memory use * Perhaps better CPU cache performance through greater locality (all writes happen to a single vector at a time, rather than row by row) * No change in disk format or disk write performance (because data is buffered prior to write anyway.) * > Reduce external sort memory use when spilling > --------------------------------------------- > > Key: DRILL-5636 > URL: https://issues.apache.org/jira/browse/DRILL-5636 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.8.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Priority: Minor > > The external sort spills data to disk under memory pressure. The sort code > uses a generic mechanism to do the spilling: > * Use a "priority queue copier" to copy sorted records into a new batch > * Spill the new batch by writing the vectors for the newly-created batch > The above works fine when memory is plentiful. But, under low-memory > conditions, the intermediate copy can cause OOM errors. > An improved algorithm is: > * Priority queue copier works vector-by-vector > * Serialize each vector to disk > * Release its memory > * Repeat for the next vector > The advantages of the above: > * Less intermediate memory use > * Perhaps better CPU cache performance through greater locality (all writes > happen to a single vector at a time, rather than row by row) > * No change in disk format or disk write performance (because data is > buffered prior to write anyway.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)