Working with data locality in streaming using groupBy?

William Saar Fri, 05 Jun 2015 06:03:08 -0700

Hi,
How can I maintain a local state, for instance a ConcurrentHashMap, across 
different steps in a streaming chain on a single machine/process?
Static variable? (This doesn't seem to work well when running locally as it 
gets shared across multiple instances, a common "pipeline" store would be 
helpful)


Is it OK to checkpoint such a local state in a single map operation at the 
beginning of the pipeline, or does it need to be done for every function?

Will multiple groupBy steps using the same key selector output pass data to the 
same machines? (To preserve data locality)

How can I do a fold/reduce operation that only returns its result after a full 
window has been processed, even when the processing in the window includes 
streams that have been distributed and merged from different machines using 
groupBy?

My scenario is as follows
I want to build up and partition a large state across different machines by 
using groupBy on a stream. The processing occurs in a window and some 
processing needs to be done on multiple machines so I want to do additional 
groupBy operators to pass partial results to other machines. Pseudo code:

flattenedWindowStream = streamSource.groupBy(myKeySelector). // Initial 
paritioning
map(localStateSaverCheckpointMapper).  //Checkpoint that saves local state, 
just passes through the data
window(Count(100)).flatten();

localAndRemoteStream =  flattenedWindowStream.split(event -> 
canBeProcessedLocally(event) ? "local" : "remote" );
remoteStream = localAndRemoteStream.select("remote").
map(partialProcessing).  // Partially process what I can with my local state
groupBy(myKeySelector). // Send the partial processing to the machines that own 
the rest of the data
map(process);
globalResult = 
localAndRemoteStream.select("local").map(process).union(remoteStream).broadcast();
 // Broadcast all fully processed results to all machines

globalResult.fold().addSink(globalWindowOutputSink)  // fold/reduce, I want a 
result based on the full contents of the window

Any help would be greatly appreciated!

Thanks,
William

Working with data locality in streaming using groupBy?

Reply via email to