GitHub user sachingoel0101 opened a pull request:
https://github.com/apache/flink/pull/1083
[FLINK-1730]Persist operator on Data Sets
This PR introduces a `persist` operation on `DataSet` which allows
persisting the data set in memory, allowing for direct access if this data set
is operated on again and again.
The idea behind the implementation is this:
1. A `PersistOperator` extending a `SingleInputUdfOperator` for common api
and Java API.
2. A `Persist` driver strategy which allows the Job Graph to generate a
`PersistNode`, which just uses a `NoOpDriver` to forward results from input to
output.
3. `RegularPactTask` determines whether it is a Persist task and
accordingly uses a `SpillingResettableMutableObjectIterator` to read the input
and persist them.
4. To make the results truly persistent, the `MemorySegment`s must not be
freed when the `Task` ends. To this end, I have created a
`DummyPersistInvokable` which does nothing. It just prevents freeing of memory.
5. All persisted memory segments are cleared out when the `MemoryManager`
is shutting down. There is a possibility of writing some kind of Cache clearing
strategy here.
For testing the functionality, I have written a test `PersistITCase` which
generates 100 random Long values inside a Map function and persisted the
output. Then, triggering the execution twice must provide the same results.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachingoel0101/flink flink-1730
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1083.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1083
----
commit a22cc670697cc601facb164f6fd84ef6438c2499
Author: Sachin Goel <[email protected]>
Date: 2015-08-24T16:07:04Z
Implemented a persist operator which caches elements into a Spilling
buffer.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---