[ 
https://issues.apache.org/jira/browse/DRILL-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873945#comment-15873945
 ] 

ASF GitHub Bot commented on DRILL-5275:
---------------------------------------

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/754

    DRILL-5275: Sort spill is slow due to repeated allocations

    DRILL-5275 -  Sort spill serialization is slow due to repeated buffer
    allocations
    
    Rather than create a heap buffer per vector when writing and reading,
    the revised code creates a single, shared buffer used for all I/O
    within a particular container. This improves performance by reducing GC
    and CPU costs during I/Os.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5275

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/754.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #754
    
----
commit 60b95f777eefe343fd49e380d03128090fd96a7a
Author: Paul Rogers <prog...@maprtech.com>
Date:   2017-02-20T01:53:31Z

    DRILL-5275 Sort spill is slow due to repeated allocations
    
    DRILL-5275 -  Sort spill serialization is slow due to repeated buffer
    allocations
    
    Rather than create a heap buffer per vector when writing and reading,
    the revised code creates a single, shared buffer used for all I/O
    within a particular container. This improves performance by reducing GC
    and CPU costs during I/Os.

----


> Sort spill serialization is slow due to repeated buffer allocations
> -------------------------------------------------------------------
>
>                 Key: DRILL-5275
>                 URL: https://issues.apache.org/jira/browse/DRILL-5275
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.10.0
>
>
> Drill provides a sort operator that spills to disk. The spill and read 
> operations use the serialization code in the 
> {{VectorAccessibleSerializable}}. This code, in turn, uses the 
> {{DrillBuf.getBytes()}} method to write to an output stream. (Yes, the "get" 
> method writes, and the "write" method reads...)
> The DrillBuf method turns around and calls the UDLE method that does:
> {code}
>             byte[] tmp = new byte[length];
>             PlatformDependent.copyMemory(addr(index), tmp, 0, length);
>             out.write(tmp);
> {code}
> That is, for each write the code allocates a heap buffer. Since Drill buffers 
> can be quite large (4, 8, 16 MB or larger), the above rapidly fills the heap 
> and causes GC.
> The result is slow performance. On a Mac, with an SSD that can do 700 MB/s of 
> I/O, we get only about 40 MB/s. Very likely because of excessive CPU cost and 
> GC.
> The solution is to allocate a single read or write buffer, then use that same 
> buffer over and over when reading or writing. This must be done in 
> {{VectorAccessibleSerializable}} as it is a per-thread class that has 
> visibility to all the buffers to be written.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to