Github user sachingoel0101 commented on the pull request:

    https://github.com/apache/flink/pull/1083#issuecomment-138844139
  
    Hey @fhueske , thanks for your comments.
    I was not aware this was intended to allow for recovery on failed jobs.
    For reusing among different jobs in the same session, I don't see how this 
doesn't solve the issue. If the Memory manager is alive, the results will be 
there for any job to use. 
    For a true across-job sharing, one possible feature would be to add a 
method for initialization from the environment as `getPersistedSource(String)` 
which would access results from a persisted data set from some entirely 
independent job.
    
    Further, this kind of makes sense on an operator level. User should have to 
ability to explicitly persist a data set in memory, which calls for providing a 
function call. I was only drawing the analogy from spark's api. I have no idea 
how they internally implement this, but if an API function is to be provided, 
it can only be done in two ways. Either return a new Operator, as a 
transformation on the original data set, or just by returning the same data set 
[like `withBroadcastSet` does]. The former seemed easier to work with, because 
it doesn't interfere with the existing mechanisms.
    
    I have implemented no new internal functionality, but only used the 
existing system. I would've loved more discussion on this but frankly, once I 
started going through the internal mechanisms, it seemed like a pretty trivial 
thing to implement. Of course that was when I wasn't aware it was intended to 
be used for recovery.
    If there is some work on persisting intermediate results for recovery, the 
same mechanism can be used for a persist operation, in which case this work is 
anyways moot. But there has to be an API call to allow users to explicitly 
cache results in memory. This is a major problem I'm facing in implementing a 
randomized splitting algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to