[
https://issues.apache.org/jira/browse/FLINK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725427#comment-14725427
]
ASF GitHub Bot commented on FLINK-1730:
---------------------------------------
GitHub user sachingoel0101 opened a pull request:
https://github.com/apache/flink/pull/1083
[FLINK-1730]Persist operator on Data Sets
This PR introduces a `persist` operation on `DataSet` which allows
persisting the data set in memory, allowing for direct access if this data set
is operated on again and again.
The idea behind the implementation is this:
1. A `PersistOperator` extending a `SingleInputUdfOperator` for common api
and Java API.
2. A `Persist` driver strategy which allows the Job Graph to generate a
`PersistNode`, which just uses a `NoOpDriver` to forward results from input to
output.
3. `RegularPactTask` determines whether it is a Persist task and
accordingly uses a `SpillingResettableMutableObjectIterator` to read the input
and persist them.
4. To make the results truly persistent, the `MemorySegment`s must not be
freed when the `Task` ends. To this end, I have created a
`DummyPersistInvokable` which does nothing. It just prevents freeing of memory.
5. All persisted memory segments are cleared out when the `MemoryManager`
is shutting down. There is a possibility of writing some kind of Cache clearing
strategy here.
For testing the functionality, I have written a test `PersistITCase` which
generates 100 random Long values inside a Map function and persisted the
output. Then, triggering the execution twice must provide the same results.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachingoel0101/flink flink-1730
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1083.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1083
----
commit a22cc670697cc601facb164f6fd84ef6438c2499
Author: Sachin Goel <[email protected]>
Date: 2015-08-24T16:07:04Z
Implemented a persist operator which caches elements into a Spilling
buffer.
----
> Add a FlinkTools.persist style method to the Data Set.
> ------------------------------------------------------
>
> Key: FLINK-1730
> URL: https://issues.apache.org/jira/browse/FLINK-1730
> Project: Flink
> Issue Type: New Feature
> Reporter: Stephan Ewen
> Priority: Minor
>
> I think this is an operation that will be needed more prominently. Defining a
> point where one long logical program is broken into different executions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)