[ https://issues.apache.org/jira/browse/FLINK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834242#comment-15834242 ]
Fabian Hueske commented on FLINK-4867: -------------------------------------- Hi [~heytitle], wow these are impressive numbers! Looking forward to the design document. Best, Fabian > Investigate code generation for improving sort performance > ---------------------------------------------------------- > > Key: FLINK-4867 > URL: https://issues.apache.org/jira/browse/FLINK-4867 > Project: Flink > Issue Type: Sub-task > Components: Local Runtime > Reporter: Gabor Gevay > Assignee: Gabor Gevay > Priority: Minor > Labels: performance > Attachments: Evaluation of Code Generation Approach for improving > Flinkās NormalizedKeySorter.pdf > > > This issue is for investigating whether code generation could speed up > sorting. We should make some performance measurements on hand-written code > that is similar to what we could generate, to see whether investing more time > into this is worth it. If we find that it is worth it, we can open a second > Jira for the actual implementation of the code generation. > I think we could generate one class at places where we currently instantiate > {{QuickSort}}. This generated class would include the functionality of > {{QuickSort}}, {{NormalizedKeySorter}} or {{FixedLengthRecordSorter}}, > {{MemorySegment.compare}}, and {{MemorySegment.swap}}. > Btw. I'm planning to give this as a student project at a TU Berlin course in > the next few months. > Some concrete ideas about how could a generated sorter be faster than the > current sorting code: > * {{MemorySegment.compare}} could be specialized for > ** Length: for small records, the loop could be unrolled > ** Endiannes (currently it is optimized for big endian; and in the little > endian case (e.g. x86) it does a Long.reverseBytes for each long read) > * {{MemorySegment.swapBytes}} > ** In case of small records, using three {{UNSAFE.copyMemory}} is probably > not as efficient as a specialized swap, because > *** We could use total loop unrolling in generated code (because we know the > exact record size) > *** {{UNSAFE.copyMemory}} checks for alignment first \[6,9\] > *** We will only need 2/3 the memory bandwidth, because the temporary storage > could be a register if we swap one byte (or one {{long}}) at a time > ** several checks might be eliminated > * Better inlining behaviour could be achieved > ** Virtual function calls to the methods of {{InMemorySorter}} could be > eliminated. (Note, that these are problematic to devirtualize by the JVM if > there are different derived classes used in a single Flink job (see \[8,7\]).) > ** {{MemorySegment.swapBytes}} is probably not inlined currently, because the > excessive checks make it too large > ** {{MemorySegment.compare}} is probably also not inlined currently, because > those two while loops are too large > \[6\] http://www.docjar.com/docs/api/sun/misc/Unsafe.html#copyMemory(Object, > long, Object, long, long) > \[7\] https://shipilev.net/blog/2015/black-magic-method-dispatch/ > \[8\] > http://insightfullogic.com/2014/May/12/fast-and-megamorphic-what-influences-method-invoca/ > \[9\] > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/cpu/x86/vm/stubGenerator_x86_64.cpp#l2409 -- This message was sent by Atlassian JIRA (v6.3.4#6332)