Proper ways to write iterative DataSets with dependencies

Li Peng Thu, 26 Jan 2017 13:35:34 -0800

Hi there, I just started investigating Flink and I'm curious if I'm
approaching my issue in the right way.


My current usecase is modeling a series of transformations, where I
start with some transformations, which when done can yield another
transformation, or a result to output to some sink, or a Join
operation that will extract data from some other data set and combine
it with existing data (and output a transformation that should be
processed like any other transform). The transformations and results
are easy to deal with, but joins are more troubling.

Here is my current solution that I got to work:

//initialSolution contains all external data to join on.
initialSolution.iterateDelta(initialWorkset, 10000, Array("id")) {
  (solution: DataSet[Work[String]], workset: DataSet[Work[String]]) => {
    //handle joins separately
    val joined = handleJoins(solution, workset.filter(isJoin(_)))

   //transformations are handled separately as well
    val transformed = handleTransformations(workset.filter(isTransform(_)))

    val nextWorkSet = transformed.filter(isWork(_)).union(joined)
    val solutionUpdate = transformed.filter(isResult(_))
    (solutionUpdate, nextWorkSet)
  }
}

My questions are:

1. Is this the right way to use Flink? Based on the documentation
(correct me if I'm wrong) it seems that in the iterative case the
external data (to be used in the join) should be in the solution
DataSet, so if this usecase has multiple external data sources to join
on, they are all collected in the initial solution DataSet. Would
having all of this different data in the solution have bad
repercussions for partitioning/performance?
2. Doing the joins as part of the iteration seems a bit wrong to me (I
might just be thinking about the issue in the wrong way). I
alternatively tried to model this approach as a series of DataStreams,
where the code is pretty much the same as above, but where the
iteration occurs on stream T, which splits off to two streams J and R,
where R is just the result sink, and J has the logic that joins
incoming data, and after the join sends the result back to stream T.
But I didn't see a good way to say "send result of J back to T, and
run all the standard iterative logic on that" using the data stream
API. I could manually create some endpoints for these streams to hit
and thus achieve this behavior, but is there an easy way I'm missing
that can achieve this via the flink api?

Thanks,
Li

Proper ways to write iterative DataSets with dependencies

Reply via email to