[jira] [Commented] (FLINK-4867) Investigate code generation for improving sort performance

Fabian Hueske (JIRA) Mon, 23 Jan 2017 02:38:13 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834242#comment-15834242
 ]


Fabian Hueske commented on FLINK-4867:
--------------------------------------

Hi [~heytitle], wow these are impressive numbers! 

Looking forward to the design document.
Best, Fabian

> Investigate code generation for improving sort performance
> ----------------------------------------------------------
>
>                 Key: FLINK-4867
>                 URL: https://issues.apache.org/jira/browse/FLINK-4867
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Local Runtime
>            Reporter: Gabor Gevay
>            Assignee: Gabor Gevay
>            Priority: Minor
>              Labels: performance
>         Attachments: Evaluation of Code Generation Approach for improving  
> Flink’s NormalizedKeySorter.pdf
>
>
> This issue is for investigating whether code generation could speed up 
> sorting. We should make some performance measurements on hand-written code 
> that is similar to what we could generate, to see whether investing more time 
> into this is worth it. If we find that it is worth it, we can open a second 
> Jira for the actual implementation of the code generation.
> I think we could generate one class at places where we currently instantiate 
> {{QuickSort}}. This generated class would include the functionality of 
> {{QuickSort}}, {{NormalizedKeySorter}} or {{FixedLengthRecordSorter}}, 
> {{MemorySegment.compare}}, and {{MemorySegment.swap}}.
> Btw. I'm planning to give this as a student project at a TU Berlin course in 
> the next few months.
> Some concrete ideas about how could a generated sorter be faster than the 
> current sorting code:
> * {{MemorySegment.compare}} could be specialized for
> ** Length: for small records, the loop could be unrolled
> ** Endiannes (currently it is optimized for big endian; and in the little 
> endian case (e.g. x86) it does a Long.reverseBytes for each long read)
> * {{MemorySegment.swapBytes}}
> ** In case of small records, using three {{UNSAFE.copyMemory}} is probably 
> not as efficient as a specialized swap, because
> *** We could use total loop unrolling in generated code (because we know the 
> exact record size)
> *** {{UNSAFE.copyMemory}} checks for alignment first \[6,9\]
> *** We will only need 2/3 the memory bandwidth, because the temporary storage 
> could be a register if we swap one byte (or one {{long}}) at a time
> ** several checks might be eliminated
> * Better inlining behaviour could be achieved 
> ** Virtual function calls to the methods of {{InMemorySorter}} could be 
> eliminated. (Note, that these are problematic to devirtualize by the JVM if 
> there are different derived classes used in a single Flink job (see \[8,7\]).)
> ** {{MemorySegment.swapBytes}} is probably not inlined currently, because the 
> excessive checks make it too large
> ** {{MemorySegment.compare}} is probably also not inlined currently, because 
> those two while loops are too large
> \[6\] http://www.docjar.com/docs/api/sun/misc/Unsafe.html#copyMemory(Object, 
> long, Object, long, long)
> \[7\] https://shipilev.net/blog/2015/black-magic-method-dispatch/
> \[8\] 
> http://insightfullogic.com/2014/May/12/fast-and-megamorphic-what-influences-method-invoca/
> \[9\] 
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/cpu/x86/vm/stubGenerator_x86_64.cpp#l2409



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4867) Investigate code generation for improving sort performance

Reply via email to