[ https://issues.apache.org/jira/browse/DRILL-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873945#comment-15873945 ]
ASF GitHub Bot commented on DRILL-5275: --------------------------------------- GitHub user paul-rogers opened a pull request: https://github.com/apache/drill/pull/754 DRILL-5275: Sort spill is slow due to repeated allocations DRILL-5275 - Sort spill serialization is slow due to repeated buffer allocations Rather than create a heap buffer per vector when writing and reading, the revised code creates a single, shared buffer used for all I/O within a particular container. This improves performance by reducing GC and CPU costs during I/Os. You can merge this pull request into a Git repository by running: $ git pull https://github.com/paul-rogers/drill DRILL-5275 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/754.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #754 ---- commit 60b95f777eefe343fd49e380d03128090fd96a7a Author: Paul Rogers <prog...@maprtech.com> Date: 2017-02-20T01:53:31Z DRILL-5275 Sort spill is slow due to repeated allocations DRILL-5275 - Sort spill serialization is slow due to repeated buffer allocations Rather than create a heap buffer per vector when writing and reading, the revised code creates a single, shared buffer used for all I/O within a particular container. This improves performance by reducing GC and CPU costs during I/Os. ---- > Sort spill serialization is slow due to repeated buffer allocations > ------------------------------------------------------------------- > > Key: DRILL-5275 > URL: https://issues.apache.org/jira/browse/DRILL-5275 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.10.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Fix For: 1.10.0 > > > Drill provides a sort operator that spills to disk. The spill and read > operations use the serialization code in the > {{VectorAccessibleSerializable}}. This code, in turn, uses the > {{DrillBuf.getBytes()}} method to write to an output stream. (Yes, the "get" > method writes, and the "write" method reads...) > The DrillBuf method turns around and calls the UDLE method that does: > {code} > byte[] tmp = new byte[length]; > PlatformDependent.copyMemory(addr(index), tmp, 0, length); > out.write(tmp); > {code} > That is, for each write the code allocates a heap buffer. Since Drill buffers > can be quite large (4, 8, 16 MB or larger), the above rapidly fills the heap > and causes GC. > The result is slow performance. On a Mac, with an SSD that can do 700 MB/s of > I/O, we get only about 40 MB/s. Very likely because of excessive CPU cost and > GC. > The solution is to allocate a single read or write buffer, then use that same > buffer over and over when reading or writing. This must be done in > {{VectorAccessibleSerializable}} as it is a per-thread class that has > visibility to all the buffers to be written. -- This message was sent by Atlassian JIRA (v6.3.15#6346)