GitHub user sachingoel0101 opened a pull request:

    https://github.com/apache/flink/pull/1083

    [FLINK-1730]Persist operator on Data Sets

    This PR introduces a `persist` operation on `DataSet` which allows 
persisting the data set in memory, allowing for direct access if this data set 
is operated on again and again.
    The idea behind the implementation is this:
    1. A `PersistOperator` extending a `SingleInputUdfOperator` for common api 
and Java API.
    2. A `Persist` driver strategy which allows the Job Graph to generate a 
`PersistNode`, which just uses a `NoOpDriver` to forward results from input to 
output.
    3. `RegularPactTask` determines whether it is a Persist task and 
accordingly uses a `SpillingResettableMutableObjectIterator` to read the input 
and persist them.
    4. To make the results truly persistent, the `MemorySegment`s must not be 
freed when the `Task` ends. To this end, I have created a 
`DummyPersistInvokable` which does nothing. It just prevents freeing of memory.
    5. All persisted memory segments are cleared out when the `MemoryManager` 
is shutting down. There is a possibility of writing some kind of Cache clearing 
strategy here.
    
    For testing the functionality, I have written a test `PersistITCase` which 
generates 100 random Long values inside a Map function and persisted the 
output. Then, triggering the execution twice must provide the same results.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachingoel0101/flink flink-1730

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1083.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1083
    
----
commit a22cc670697cc601facb164f6fd84ef6438c2499
Author: Sachin Goel <[email protected]>
Date:   2015-08-24T16:07:04Z

    Implemented a persist operator which caches elements into a Spilling
    buffer.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to