[ https://issues.apache.org/jira/browse/DRILL-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884101#comment-15884101 ]
ASF GitHub Bot commented on DRILL-5275: --------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/754 > Sort spill serialization is slow due to repeated buffer allocations > ------------------------------------------------------------------- > > Key: DRILL-5275 > URL: https://issues.apache.org/jira/browse/DRILL-5275 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.10.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Labels: ready-to-commit > Fix For: 1.10.0 > > > Drill provides a sort operator that spills to disk. The spill and read > operations use the serialization code in the > {{VectorAccessibleSerializable}}. This code, in turn, uses the > {{DrillBuf.getBytes()}} method to write to an output stream. (Yes, the "get" > method writes, and the "write" method reads...) > The DrillBuf method turns around and calls the UDLE method that does: > {code} > byte[] tmp = new byte[length]; > PlatformDependent.copyMemory(addr(index), tmp, 0, length); > out.write(tmp); > {code} > That is, for each write the code allocates a heap buffer. Since Drill buffers > can be quite large (4, 8, 16 MB or larger), the above rapidly fills the heap > and causes GC. > The result is slow performance. On a Mac, with an SSD that can do 700 MB/s of > I/O, we get only about 40 MB/s. Very likely because of excessive CPU cost and > GC. > The solution is to allocate a single read or write buffer, then use that same > buffer over and over when reading or writing. This must be done in > {{VectorAccessibleSerializable}} as it is a per-thread class that has > visibility to all the buffers to be written. -- This message was sent by Atlassian JIRA (v6.3.15#6346)