[ 
https://issues.apache.org/jira/browse/DRILL-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5636:
-------------------------------
    Description: 
The external sort spills data to disk under memory pressure. The sort code uses 
a generic mechanism to do the spilling:

* Use a "priority queue copier" to copy sorted records into a new batch
* Spill the new batch by writing the vectors for the newly-created batch

The above works fine when memory is plentiful. But, under low-memory 
conditions, the intermediate copy can cause OOM errors.

An improved algorithm is:

* Priority queue copier works vector-by-vector
* Serialize each vector to disk
* Release its memory
* Repeat for the next vector

The advantages of the above:

* Less intermediate memory use
* Perhaps better CPU cache performance through greater locality (all writes 
happen to a single vector at a time, rather than row by row)
* No change in disk format or disk write performance (because data is buffered 
prior to write anyway.)

  was:
The external sort spills data to disk under memory pressure. The sort code uses 
a generic mechanism to do the spilling:

* Use a "priority queue copier" to copy sorted records into a new batch
* Spill the new batch by writing the vectors for the newly-created batch

The above works fine when memory is plentiful. But, under low-memory 
conditions, the intermediate copy can cause OOM errors.

An improved algorithm is:

* Priority queue copier works vector-by-vector
* Serialize each vector to disk
* Release its memory
* Repeat for the next vector

The advantages of the above:

* Less intermediate memory use
* Perhaps better CPU cache performance through greater locality (all writes 
happen to a single vector at a time, rather than row by row)
* No change in disk format or disk write performance (because data is buffered 
prior to write anyway.)
* 


> Reduce external sort memory use when spilling
> ---------------------------------------------
>
>                 Key: DRILL-5636
>                 URL: https://issues.apache.org/jira/browse/DRILL-5636
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> The external sort spills data to disk under memory pressure. The sort code 
> uses a generic mechanism to do the spilling:
> * Use a "priority queue copier" to copy sorted records into a new batch
> * Spill the new batch by writing the vectors for the newly-created batch
> The above works fine when memory is plentiful. But, under low-memory 
> conditions, the intermediate copy can cause OOM errors.
> An improved algorithm is:
> * Priority queue copier works vector-by-vector
> * Serialize each vector to disk
> * Release its memory
> * Repeat for the next vector
> The advantages of the above:
> * Less intermediate memory use
> * Perhaps better CPU cache performance through greater locality (all writes 
> happen to a single vector at a time, rather than row by row)
> * No change in disk format or disk write performance (because data is 
> buffered prior to write anyway.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to