GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/1093

    [FLINK-1320] [core] Add an off-heap variant of the managed memory

    This pull request extends the Flink managed memory to work transparently 
on-heap and off-heap.
    
    In Flink's core were always memory management techniques that operated on 
serialized data (rather than objects) on pre-reserved memory segments on the 
heap. This has the huge benefit that the memory usage can be exactly 
controlled, and that the Garbage Collector was "tamed" by now accumulating 
billions of long objects lived objects on the heap.
    
    One could say that Flink was using off-heap style techniques with on-heap 
memory. With this pull request, the system can use memory outside the heap, as 
an alternative to heap memory. The benefit is that this gets the managed memory 
ozt of the GC's responsibility, and vastly reduces the overall heap size (since 
often 75% of the memory is used as Flink's managed memory, for sorting, 
hashing, caching, ...).
    
    That way, Flink processes can be scaled to 100s if Gigabytes of main memory.
    
    
    ## Usage
    
    The pull request introduces one simple config flag: 
`taskmanager.memory.off-heap: true`
    
      - In heap mode, the TaskManager takes 0.7 of the free heap for the Memory 
Manager (same as before)
      - In off-heap mode, the TaskManager takes 3x the size of the heap for the 
Memory Manager. This implies that the startup scripts should scale the heap 
size doen to 1/4th, since all big system memory consumers (network, memory 
manager) now allocate outside the heap.
      - One can control the off-heap memory size via `taskmanager.memory.size` 
and `taskmanager.memory.off-heap-ratio`.
    
    
    ## Design decisions
    
    1. Since the operations on Flink's memory segments are in the innermost 
loops, they are super performance critical. All code is written to be very JIT 
friendly.
    
    2. Even though the bulk of the memory may be off-heap (if that mode is 
chosen), there are various points where short-lived unpooled managed heap 
memory is needed, so n the off-heap case, we need to support mixed memory (on- 
and off-heap simultaneously).
    
    3. While benchmarking various possible implementations, I found that many 
operations can be written to transparently work on on-heap and off-heap memory. 
The only notable exception are individual byte accesses, which are faster on 
specialized heap implementations (up to 50% faster, see benchmarks below).
    
    4. Because individual byte operations occur a lot for Strings and 
tags/headers, I did not want to compromise on their performance.
    
      We now have the following MemorySegment implementations:
    
        - `MemorySegment`: Implements all multi-byte type accesses (ints, 
longs, doubles) and copy/compare/swap methods to transparently work on 
heap/off-heap memory.
      
        - `HeapMemorySegment`: Optimized single-byte operations, and bulk byte 
transfers. Used for on-heap case.
    
        - `HybridMemorySegment`: Transparent on-heap/off-heap single byte 
operations and bulk transfers, slightly slower than the HeapMemorySegment. Used 
for off-heap case, and can also represent the unpooled on-heap memory at the 
same time.
    
      Effectively, performance of some off-heap memory operations are a bit 
slower than on-heap operations. I think that cannot be avoided, and in this 
implementation, we have a way to keep peak-performance of managed memory on 
heap.
      
    5. For optimal JIT efficiency, many methods are marked as final. In 
addition, performance increases if only one of the MemorySegment 
specializations is used at a time, and not both are mixed. That way the JIT can 
optimally de-virtualize and inline methods. Otherwise, it has to switch between 
optimization/deoptimization or use bimorphic inlining. Since the memory segment 
accesses are the absolute innermost loops, we care about these last few CPU 
cycles. The benchmarks below illustrate this.
    
      Direct instantiation of the memory segments is not easily possible, but 
access must go through a `MemorySegmentFactory` that manages which 
implementation to load and use.
    
    6. The implementations of the individual memory segment functions take care 
of the following:
        - Use methods that are JIT intrinsics, i.e., get jitted to optimized 
native instructions.
        - Use collapsed checks for range checks and disposal checks.
        - Avoids branches between heap/off-heap cases, but use uniform code 
path (for primitive type accesses).
        - The code paths for valid (non exception) cases are always the first 
option at branches, for single element operations.
        - As long as the memory segment object exists, all operations are 
guaranteed to work and not lead to segmentation faults, if the segment has been 
concurrently freed.
    
    
    ## Tests 
    
    More than 6k lines of tests in total, for all types of memory segments and 
memory (heap, hybrid-on-heap, hybrid-off-heap) and interoperability between 
them. Tests include regular operation, attempts on released memory, bound 
violations, interoperability with other heap and off-heap memory, checks for 
correct exception types. 
    
    Test coverage reports shows that virtually all conditionals are tested, 
except for the once that are effective only on big-endian processor 
architectures, and otherwise optimized away by class-loading/JIT.
    
    
    ## Open issues
    
      - The TaskManager startup script needs to be adjusted to interpret the 
`taskmanager.memory.off-heap` config entry and adjust the heap size and 
off-heap size according to `taskmanager.memory.size` and 
`taskmanager.memory.off-heap-ratio`.
      - Has not yet been integrated with the YarnTaskManagerRunner.
      - Needs some cluster testing
    
    ## Microbenchmarks of the Memory Segments
    
    These microbenchmarks test the performance of the memory segments on 
various operation.
    The code is under 
`flink-core/src/test/java/org/apache/flink/core/memory/benchmarks/MemorySegmentSpeedBenchmark.java`
    
      - Oracle Java 8 (1.8.0_25)
      - 4GB heap (the experiments need 2.2 GB)
      - Intel Core i7-4700MQ CPU, 2.40GHz (4 cores, 8 hardware contexts)
    
      - All experiments were run 5x, discaring the fastest and slowest run, and 
then averaged. This compensated for delay before the JIT kicks in.
    
      - Each run tests the different implementations in different orders, to 
balance the advantage/disadvantage of the JIT specializing towards certain code 
paths.
    
    The tested implementations are
    
      - `HeapMemorySegment` (exclusive): The case where it is the only loaded 
MemorySegment class.
      - `HeapMemorySegment` (mixed): The case where both the 
`HeapMemorySegment` and the `HybridMemorySegment` are loaded.
      - `HybridMemorySegment` (heap-exclusive): Backed by heap memory, and the 
case where it is the only loaded MemorySegment class.
      - `HybridMemorySegment` (heap-mixed): Backed by heap memory, and the case 
where both the `HeapMemorySegment` and the `HybridMemorySegment` are loaded.
      - `HybridMemorySegment` (off-heap-exclusive): Backed by off-heap memory, 
and the case where it is the only loaded MemorySegment class.
      - `HybridMemorySegment` (off-heap-mixed): Backed by heap off-memory, and 
the case where both the `HeapMemorySegment` and the `HybridMemorySegment` are 
loaded.
      - `PureHeapSegment`, which has no class hierarchy and virtual methods at 
all.
      - `PureHybridSegment` (heap), which has no class hierarchy and virtual 
methods at all, backed by heap memory.
      - `PureHybridSegment` (off-heap), which has no class hierarchy and 
virtual methods at all, backed by off-heap memory.
    
    
    A summary of the experiments:
      - The hybrid memory segment performs equally well in heap and off-heap 
memory
      - The heap memory segments are quite a bit better in reading individual 
bytes, not so much at writing them.
      - The abstract class MemorySegment (with its subclasses 
`HeapMemorySegment` and `HybridMemorySegment`) performs as well as any 
specialized non-abstract class, as long as only one subclass is loaded. When 
both are loaded, performance may suffer by a factor of more than 2x on certain 
operations.
      - How badly the performance degrades seems to depend a lot on which 
subclass is tested before and after which. Sometimes, performance is affected 
more than other times. This leads to the design to try and load at most one 
subclass of the memory segment, to get the most predictable and best 
performance.
    
      
    ### Byte accesses
    
    **Writing 100000 x 32768 bytes to 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, exclusive             | 1,441 msecs
    `HeapMemorySegment`, mixed                 | 3,841 msecs
    `HybridMemorySegment`, heap, exclusive     | 1,626 msecs
    `HybridMemorySegment`, off-heap, exclusive | 1,628 msecs
    `HybridMemorySegment`, heap, mixed         | 3,848 msecs
    `HybridMemorySegment`, off-heap, mixed     | 3,847 msecs
    `PureHeapSegment`                          | 1,442 msecs
    `PureHybridSegment`, heap                  | 1,623 msecs
    `PureHybridSegment`, off-heap              | 1,620 msecs
    
    **Reading 100000 x 32768 bytes from 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, exclusive             | 1,326 msecs
    `HeapMemorySegment`, mixed                 | 1,378 msecs
    `HybridMemorySegment`, heap, exclusive     | 2,029 msecs
    `HybridMemorySegment`, off-heap, exclusive | 2,030 msecs
    `HybridMemorySegment`, heap, mixed         | 2,047 msecs
    `HybridMemorySegment`, off-heap, mixed     | 2,049 msecs
    `PureHeapSegment`                          | 1,331 msecs
    `PureHybridSegment`, heap                  | 2,030 msecs
    `PureHybridSegment`, off-heap              | 2,030 msecs
    
    **Writing 10 x 1073741824 bytes to 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, exclusive             | 5,602 msecs
    `HeapMemorySegment`, mixed                 | 12,570 msecs
    `HybridMemorySegment`, heap, exclusive     | 5,391 msecs
    `HybridMemorySegment`, off-heap, exclusive | 5,391 msecs
    `HybridMemorySegment`, heap, mixed         | 12,566 msecs
    `HybridMemorySegment`, off-heap, mixed     | 12,556 msecs
    `PureHeapSegment`                          | 5,599 msecs
    `PureHybridSegment`, heap                  | 5,387 msecs
    `PureHybridSegment`, off-heap              | 5,381 msecs
    
    **Reading 10 x 1073741824 bytes from 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, exclusive             | 4,243 msecs
    `HeapMemorySegment`, mixed                 | 4,265 msecs
    `HybridMemorySegment`, heap, exclusive     | 6,730 msecs
    `HybridMemorySegment`, off-heap, exclusive | 6,725 msecs
    `HybridMemorySegment`, heap, mixed         | 6,933 msecs
    `HybridMemorySegment`, off-heap, mixed     | 6,926 msecs
    `PureHeapSegment`                          | 4,247 msecs
    `PureHybridSegment`, heap                  | 6,919 msecs
    `PureHybridSegment`, off-heap              | 6,916 msecs
    
    ### Byte Array accesses
    
    **Writing 100000 x 32 byte[1024] to 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 164 msecs
    `HybridMemorySegment`, heap, mixed         | 163 msecs
    `HybridMemorySegment`, off-heap, mixed     | 163 msecs
    `PureHeapSegment`                          | 165 msecs
    `PureHybridSegment`, heap                  | 182 msecs
    `PureHybridSegment`, off-heap              | 176 msecs
    
    **Reading 100000 x 32 byte[1024] from 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 157 msecs
    `HybridMemorySegment`, heap, mixed         | 155 msecs
    `HybridMemorySegment`, off-heap, mixed     | 162 msecs
    `PureHeapSegment`                          | 161 msecs
    `PureHybridSegment`, heap                  | 175 msecs
    `PureHybridSegment`, off-heap              | 179 msecs
    
    **Writing 10 x 1048576 byte[1024] to 1073741824 bytes segment** 
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 1,164 msecs
    `HybridMemorySegment`, heap, mixed         | 1,173 msecs
    `HybridMemorySegment`, off-heap, mixed     | 1,157 msecs
    `PureHeapSegment`                          | 1,169 msecs
    `PureHybridSegment`, heap                  | 1,174 msecs
    `PureHybridSegment`, off-heap              | 1,166 msecs
    
    **Reading 10 x 1048576 byte[1024] from 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 854 msecs
    `HybridMemorySegment`, heap, mixed         | 853 msecs
    `HybridMemorySegment`, off-heap, mixed     | 854 msecs
    `PureHeapSegment`                          | 857 msecs
    `PureHybridSegment`, heap                  | 896 msecs
    `PureHybridSegment`, off-heap              | 887 msecs
    
    
    ### Long integer accesses
    
    *(note that the heap and off-heap segments use the same or comparable code 
for this)*
    
    **Writing 100000 x 4096 longs to 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 221 msecs
    `HybridMemorySegment`, heap, mixed         | 222 msecs
    `HybridMemorySegment`, off-heap, mixed     | 221 msecs
    `PureHeapSegment`                          | 194 msecs
    `PureHybridSegment`, heap                  | 220 msecs
    `PureHybridSegment`, off-heap              | 221 msecs
    
    **Reading 100000 x 4096 longs from 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 233 msecs
    `HybridMemorySegment`, heap, mixed         | 232 msecs
    `HybridMemorySegment`, off-heap, mixed     | 231 msecs
    `PureHeapSegment`                          | 232 msecs
    `PureHybridSegment`, heap                  | 232 msecs
    `PureHybridSegment`, off-heap              | 233 msecs
    
    **Writing 10 x 134217728 longs to 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 1,120 msecs
    `HybridMemorySegment`, heap, mixed         | 1,120 msecs
    `HybridMemorySegment`, off-heap, mixed     | 1,115 msecs
    `PureHeapSegment`                          | 1,148 msecs
    `PureHybridSegment`, heap                  | 1,116 msecs
    `PureHybridSegment`, off-heap              | 1,113 msecs
    
    **Reading 10 x 134217728 longs from 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 1,097 msecs
    `HybridMemorySegment`, heap, mixed         | 1,099 msecs
    `HybridMemorySegment`, off-heap, mixed     | 1,093 msecs
    `PureHeapSegment`                          | 917 msecs
    `PureHybridSegment`, heap                  | 1,105 msecs
    `PureHybridSegment`, off-heap              | 1,097 msecs
    
    
    ### Integer accesses
    
    *(note that the heap and off-heap segments use the same or comparable code 
for this)*
    
    **Writing 100000 x 8192 ints to 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 578 msecs
    `HybridMemorySegment`, heap, mixed         | 580 msecs
    `HybridMemorySegment`, off-heap, mixed     | 576 msecs
    `PureHeapSegment`                          | 624 msecs
    `PureHybridSegment`, heap                  | 576 msecs
    `PureHybridSegment`, off-heap              | 578 msecs
    
    **Reading 100000 x 8192 ints from 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 464 msecs
    `HybridMemorySegment`, heap, mixed         | 464 msecs
    `HybridMemorySegment`, off-heap, mixed     | 465 msecs
    `PureHeapSegment`                          | 463 msecs
    `PureHybridSegment`, heap                  | 464 msecs
    `PureHybridSegment`, off-heap              | 463 msecs
    
    **Writing 10 x 268435456 ints to 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 2,187 msecs
    `HybridMemorySegment`, heap, mixed         | 2,161 msecs
    `HybridMemorySegment`, off-heap, mixed     | 2,152 msecs
    `PureHeapSegment`                          | 2,770 msecs
    `PureHybridSegment`, heap                  | 2,161 msecs
    `PureHybridSegment`, off-heap              | 2,157 msecs
    
    **Reading 10 x 268435456 ints from 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 1,782 msecs
    `HybridMemorySegment`, heap, mixed         | 1,783 msecs
    `HybridMemorySegment`, off-heap, mixed     | 1,774 msecs
    `PureHeapSegment`                          | 1,501 msecs
    `PureHybridSegment`, heap                  | 1,774 msecs
    `PureHybridSegment`, off-heap              | 1,771 msecs
    
    
    
    There is still a bit of mystery left, specifically why sometimes code is 
faster when it performs more checks (has more instructions and an additional 
branch). Even though the branch is perfectly predictable, this seems 
counter-intuitive. The only explanation that I could come up with is that the 
JIT does a better register allocation in that case (for what ever reason, maybe 
the instructions just fit the algorithm better).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink mem_mixed

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1093.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1093
    
----
commit 2c359d7cb83daa990e4fe47c69c6a8bae018e534
Author: Stephan Ewen <se...@apache.org>
Date:   2015-08-30T20:36:46Z

    [FLINK-1320] [core] Add an off-heap variant of the managed memory
    
    This closes #290

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to