GitHub user StephanEwen opened a pull request:
https://github.com/apache/flink/pull/1093
[FLINK-1320] [core] Add an off-heap variant of the managed memory
This pull request extends the Flink managed memory to work transparently
on-heap and off-heap.
In Flink's core were always memory management techniques that operated on
serialized data (rather than objects) on pre-reserved memory segments on the
heap. This has the huge benefit that the memory usage can be exactly
controlled, and that the Garbage Collector was "tamed" by now accumulating
billions of long objects lived objects on the heap.
One could say that Flink was using off-heap style techniques with on-heap
memory. With this pull request, the system can use memory outside the heap, as
an alternative to heap memory. The benefit is that this gets the managed memory
ozt of the GC's responsibility, and vastly reduces the overall heap size (since
often 75% of the memory is used as Flink's managed memory, for sorting,
hashing, caching, ...).
That way, Flink processes can be scaled to 100s if Gigabytes of main memory.
## Usage
The pull request introduces one simple config flag:
`taskmanager.memory.off-heap: true`
- In heap mode, the TaskManager takes 0.7 of the free heap for the Memory
Manager (same as before)
- In off-heap mode, the TaskManager takes 3x the size of the heap for the
Memory Manager. This implies that the startup scripts should scale the heap
size doen to 1/4th, since all big system memory consumers (network, memory
manager) now allocate outside the heap.
- One can control the off-heap memory size via `taskmanager.memory.size`
and `taskmanager.memory.off-heap-ratio`.
## Design decisions
1. Since the operations on Flink's memory segments are in the innermost
loops, they are super performance critical. All code is written to be very JIT
friendly.
2. Even though the bulk of the memory may be off-heap (if that mode is
chosen), there are various points where short-lived unpooled managed heap
memory is needed, so n the off-heap case, we need to support mixed memory (on-
and off-heap simultaneously).
3. While benchmarking various possible implementations, I found that many
operations can be written to transparently work on on-heap and off-heap memory.
The only notable exception are individual byte accesses, which are faster on
specialized heap implementations (up to 50% faster, see benchmarks below).
4. Because individual byte operations occur a lot for Strings and
tags/headers, I did not want to compromise on their performance.
We now have the following MemorySegment implementations:
- `MemorySegment`: Implements all multi-byte type accesses (ints,
longs, doubles) and copy/compare/swap methods to transparently work on
heap/off-heap memory.
- `HeapMemorySegment`: Optimized single-byte operations, and bulk byte
transfers. Used for on-heap case.
- `HybridMemorySegment`: Transparent on-heap/off-heap single byte
operations and bulk transfers, slightly slower than the HeapMemorySegment. Used
for off-heap case, and can also represent the unpooled on-heap memory at the
same time.
Effectively, performance of some off-heap memory operations are a bit
slower than on-heap operations. I think that cannot be avoided, and in this
implementation, we have a way to keep peak-performance of managed memory on
heap.
5. For optimal JIT efficiency, many methods are marked as final. In
addition, performance increases if only one of the MemorySegment
specializations is used at a time, and not both are mixed. That way the JIT can
optimally de-virtualize and inline methods. Otherwise, it has to switch between
optimization/deoptimization or use bimorphic inlining. Since the memory segment
accesses are the absolute innermost loops, we care about these last few CPU
cycles. The benchmarks below illustrate this.
Direct instantiation of the memory segments is not easily possible, but
access must go through a `MemorySegmentFactory` that manages which
implementation to load and use.
6. The implementations of the individual memory segment functions take care
of the following:
- Use methods that are JIT intrinsics, i.e., get jitted to optimized
native instructions.
- Use collapsed checks for range checks and disposal checks.
- Avoids branches between heap/off-heap cases, but use uniform code
path (for primitive type accesses).
- The code paths for valid (non exception) cases are always the first
option at branches, for single element operations.
- As long as the memory segment object exists, all operations are
guaranteed to work and not lead to segmentation faults, if the segment has been
concurrently freed.
## Tests
More than 6k lines of tests in total, for all types of memory segments and
memory (heap, hybrid-on-heap, hybrid-off-heap) and interoperability between
them. Tests include regular operation, attempts on released memory, bound
violations, interoperability with other heap and off-heap memory, checks for
correct exception types.
Test coverage reports shows that virtually all conditionals are tested,
except for the once that are effective only on big-endian processor
architectures, and otherwise optimized away by class-loading/JIT.
## Open issues
- The TaskManager startup script needs to be adjusted to interpret the
`taskmanager.memory.off-heap` config entry and adjust the heap size and
off-heap size according to `taskmanager.memory.size` and
`taskmanager.memory.off-heap-ratio`.
- Has not yet been integrated with the YarnTaskManagerRunner.
- Needs some cluster testing
## Microbenchmarks of the Memory Segments
These microbenchmarks test the performance of the memory segments on
various operation.
The code is under
`flink-core/src/test/java/org/apache/flink/core/memory/benchmarks/MemorySegmentSpeedBenchmark.java`
- Oracle Java 8 (1.8.0_25)
- 4GB heap (the experiments need 2.2 GB)
- Intel Core i7-4700MQ CPU, 2.40GHz (4 cores, 8 hardware contexts)
- All experiments were run 5x, discaring the fastest and slowest run, and
then averaged. This compensated for delay before the JIT kicks in.
- Each run tests the different implementations in different orders, to
balance the advantage/disadvantage of the JIT specializing towards certain code
paths.
The tested implementations are
- `HeapMemorySegment` (exclusive): The case where it is the only loaded
MemorySegment class.
- `HeapMemorySegment` (mixed): The case where both the
`HeapMemorySegment` and the `HybridMemorySegment` are loaded.
- `HybridMemorySegment` (heap-exclusive): Backed by heap memory, and the
case where it is the only loaded MemorySegment class.
- `HybridMemorySegment` (heap-mixed): Backed by heap memory, and the case
where both the `HeapMemorySegment` and the `HybridMemorySegment` are loaded.
- `HybridMemorySegment` (off-heap-exclusive): Backed by off-heap memory,
and the case where it is the only loaded MemorySegment class.
- `HybridMemorySegment` (off-heap-mixed): Backed by heap off-memory, and
the case where both the `HeapMemorySegment` and the `HybridMemorySegment` are
loaded.
- `PureHeapSegment`, which has no class hierarchy and virtual methods at
all.
- `PureHybridSegment` (heap), which has no class hierarchy and virtual
methods at all, backed by heap memory.
- `PureHybridSegment` (off-heap), which has no class hierarchy and
virtual methods at all, backed by off-heap memory.
A summary of the experiments:
- The hybrid memory segment performs equally well in heap and off-heap
memory
- The heap memory segments are quite a bit better in reading individual
bytes, not so much at writing them.
- The abstract class MemorySegment (with its subclasses
`HeapMemorySegment` and `HybridMemorySegment`) performs as well as any
specialized non-abstract class, as long as only one subclass is loaded. When
both are loaded, performance may suffer by a factor of more than 2x on certain
operations.
- How badly the performance degrades seems to depend a lot on which
subclass is tested before and after which. Sometimes, performance is affected
more than other times. This leads to the design to try and load at most one
subclass of the memory segment, to get the most predictable and best
performance.
### Byte accesses
**Writing 100000 x 32768 bytes to 32768 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, exclusive | 1,441 msecs
`HeapMemorySegment`, mixed | 3,841 msecs
`HybridMemorySegment`, heap, exclusive | 1,626 msecs
`HybridMemorySegment`, off-heap, exclusive | 1,628 msecs
`HybridMemorySegment`, heap, mixed | 3,848 msecs
`HybridMemorySegment`, off-heap, mixed | 3,847 msecs
`PureHeapSegment` | 1,442 msecs
`PureHybridSegment`, heap | 1,623 msecs
`PureHybridSegment`, off-heap | 1,620 msecs
**Reading 100000 x 32768 bytes from 32768 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, exclusive | 1,326 msecs
`HeapMemorySegment`, mixed | 1,378 msecs
`HybridMemorySegment`, heap, exclusive | 2,029 msecs
`HybridMemorySegment`, off-heap, exclusive | 2,030 msecs
`HybridMemorySegment`, heap, mixed | 2,047 msecs
`HybridMemorySegment`, off-heap, mixed | 2,049 msecs
`PureHeapSegment` | 1,331 msecs
`PureHybridSegment`, heap | 2,030 msecs
`PureHybridSegment`, off-heap | 2,030 msecs
**Writing 10 x 1073741824 bytes to 1073741824 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, exclusive | 5,602 msecs
`HeapMemorySegment`, mixed | 12,570 msecs
`HybridMemorySegment`, heap, exclusive | 5,391 msecs
`HybridMemorySegment`, off-heap, exclusive | 5,391 msecs
`HybridMemorySegment`, heap, mixed | 12,566 msecs
`HybridMemorySegment`, off-heap, mixed | 12,556 msecs
`PureHeapSegment` | 5,599 msecs
`PureHybridSegment`, heap | 5,387 msecs
`PureHybridSegment`, off-heap | 5,381 msecs
**Reading 10 x 1073741824 bytes from 1073741824 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, exclusive | 4,243 msecs
`HeapMemorySegment`, mixed | 4,265 msecs
`HybridMemorySegment`, heap, exclusive | 6,730 msecs
`HybridMemorySegment`, off-heap, exclusive | 6,725 msecs
`HybridMemorySegment`, heap, mixed | 6,933 msecs
`HybridMemorySegment`, off-heap, mixed | 6,926 msecs
`PureHeapSegment` | 4,247 msecs
`PureHybridSegment`, heap | 6,919 msecs
`PureHybridSegment`, off-heap | 6,916 msecs
### Byte Array accesses
**Writing 100000 x 32 byte[1024] to 32768 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 164 msecs
`HybridMemorySegment`, heap, mixed | 163 msecs
`HybridMemorySegment`, off-heap, mixed | 163 msecs
`PureHeapSegment` | 165 msecs
`PureHybridSegment`, heap | 182 msecs
`PureHybridSegment`, off-heap | 176 msecs
**Reading 100000 x 32 byte[1024] from 32768 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 157 msecs
`HybridMemorySegment`, heap, mixed | 155 msecs
`HybridMemorySegment`, off-heap, mixed | 162 msecs
`PureHeapSegment` | 161 msecs
`PureHybridSegment`, heap | 175 msecs
`PureHybridSegment`, off-heap | 179 msecs
**Writing 10 x 1048576 byte[1024] to 1073741824 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 1,164 msecs
`HybridMemorySegment`, heap, mixed | 1,173 msecs
`HybridMemorySegment`, off-heap, mixed | 1,157 msecs
`PureHeapSegment` | 1,169 msecs
`PureHybridSegment`, heap | 1,174 msecs
`PureHybridSegment`, off-heap | 1,166 msecs
**Reading 10 x 1048576 byte[1024] from 1073741824 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 854 msecs
`HybridMemorySegment`, heap, mixed | 853 msecs
`HybridMemorySegment`, off-heap, mixed | 854 msecs
`PureHeapSegment` | 857 msecs
`PureHybridSegment`, heap | 896 msecs
`PureHybridSegment`, off-heap | 887 msecs
### Long integer accesses
*(note that the heap and off-heap segments use the same or comparable code
for this)*
**Writing 100000 x 4096 longs to 32768 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 221 msecs
`HybridMemorySegment`, heap, mixed | 222 msecs
`HybridMemorySegment`, off-heap, mixed | 221 msecs
`PureHeapSegment` | 194 msecs
`PureHybridSegment`, heap | 220 msecs
`PureHybridSegment`, off-heap | 221 msecs
**Reading 100000 x 4096 longs from 32768 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 233 msecs
`HybridMemorySegment`, heap, mixed | 232 msecs
`HybridMemorySegment`, off-heap, mixed | 231 msecs
`PureHeapSegment` | 232 msecs
`PureHybridSegment`, heap | 232 msecs
`PureHybridSegment`, off-heap | 233 msecs
**Writing 10 x 134217728 longs to 1073741824 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 1,120 msecs
`HybridMemorySegment`, heap, mixed | 1,120 msecs
`HybridMemorySegment`, off-heap, mixed | 1,115 msecs
`PureHeapSegment` | 1,148 msecs
`PureHybridSegment`, heap | 1,116 msecs
`PureHybridSegment`, off-heap | 1,113 msecs
**Reading 10 x 134217728 longs from 1073741824 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 1,097 msecs
`HybridMemorySegment`, heap, mixed | 1,099 msecs
`HybridMemorySegment`, off-heap, mixed | 1,093 msecs
`PureHeapSegment` | 917 msecs
`PureHybridSegment`, heap | 1,105 msecs
`PureHybridSegment`, off-heap | 1,097 msecs
### Integer accesses
*(note that the heap and off-heap segments use the same or comparable code
for this)*
**Writing 100000 x 8192 ints to 32768 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 578 msecs
`HybridMemorySegment`, heap, mixed | 580 msecs
`HybridMemorySegment`, off-heap, mixed | 576 msecs
`PureHeapSegment` | 624 msecs
`PureHybridSegment`, heap | 576 msecs
`PureHybridSegment`, off-heap | 578 msecs
**Reading 100000 x 8192 ints from 32768 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 464 msecs
`HybridMemorySegment`, heap, mixed | 464 msecs
`HybridMemorySegment`, off-heap, mixed | 465 msecs
`PureHeapSegment` | 463 msecs
`PureHybridSegment`, heap | 464 msecs
`PureHybridSegment`, off-heap | 463 msecs
**Writing 10 x 268435456 ints to 1073741824 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 2,187 msecs
`HybridMemorySegment`, heap, mixed | 2,161 msecs
`HybridMemorySegment`, off-heap, mixed | 2,152 msecs
`PureHeapSegment` | 2,770 msecs
`PureHybridSegment`, heap | 2,161 msecs
`PureHybridSegment`, off-heap | 2,157 msecs
**Reading 10 x 268435456 ints from 1073741824 bytes segment**
Segment | Time
------------------------------------------ | -------------
`HeapMemorySegment`, mixed | 1,782 msecs
`HybridMemorySegment`, heap, mixed | 1,783 msecs
`HybridMemorySegment`, off-heap, mixed | 1,774 msecs
`PureHeapSegment` | 1,501 msecs
`PureHybridSegment`, heap | 1,774 msecs
`PureHybridSegment`, off-heap | 1,771 msecs
There is still a bit of mystery left, specifically why sometimes code is
faster when it performs more checks (has more instructions and an additional
branch). Even though the branch is perfectly predictable, this seems
counter-intuitive. The only explanation that I could come up with is that the
JIT does a better register allocation in that case (for what ever reason, maybe
the instructions just fit the algorithm better).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/StephanEwen/incubator-flink mem_mixed
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1093.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1093
----
commit 2c359d7cb83daa990e4fe47c69c6a8bae018e534
Author: Stephan Ewen <[email protected]>
Date: 2015-08-30T20:36:46Z
[FLINK-1320] [core] Add an off-heap variant of the managed memory
This closes #290
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---