Hi, we're stacking multiple RDD operations on each other, for example as a source we have a RDD[List[String]] like
["a", "b, c", "d"] ["a", "d, a", "d"] In the first step we split the second column in two columns, in the next step we filter the data on column 3 = "c" and in the final step we're doing something else. The point is that it needs to be flexible (the user adds custom transformations and they are stacked on top of each other like the example above, which transformations are added by the user is therefor not known upfront). The transformations itself are no problems but we want to keep track of the added columns, the dropped columns and the updated columns. In the example above, the second column is dropped and two new columns are added. The intermediate result here will be ["a", "b", "c", "d"] ["a", "d", "a", "d"] And the final result will be ["a", "b", "c", "d"] What I would like to know is after each transformation which columns are added, which colunms are dropped and which ones are updated.This is information that's needed to execute the next transformation. I was thinking of two possible scenario's: 1. capture the metadata and store that in the RDD, effectively creating a RDD[List[String], List[Column], List[Column], List[Column]) object. Where the last three List[Column] contain the new, dropped or updated columns. This will result in an RDD with a lot of extra information on each row. That information is not needed on each row but rather one time for the whole split transformation 2. use accumulators to store the new, updated and dropped columns. But I don't think this is feasible Are there any better scenario's or how could I accomplish such a scenario? thanks in advance, Richard