Hi,

Well yes, I could probably make it work with a constant number of operators 
(and consequently buffers) by developing specific input and output classes, and 
that way I'd have a workaround for that buffers issue.

The size of my job is input-dependent mostly because my code creates one full 
processing chain per input folder (there is one folder per hour of data) and 
that processing chain has many functions. The number of input folders is an 
argument to the software and can vary from 1 to hundreds (when we re-compute 
something like a month of data).

for(String folder:folders){
        env = Environment.getExecutionEnvironment();
        env.readText(folder).[......].writeText(folder + "_processed");
        env.execute();
}

There is one full processing chain per folder (a loop is creating them) because 
the name of the output is the same as the name of the input, and I did not want 
to create a specific "rolling-output" with bucketers.

So, yes, I could develop a source that supports a list of folders and puts the 
source name into the produced data. My processing could also handle tuples 
where one field is the source folder and use it as a key where appropriate, and 
finally yes I could also create a sink that will "dispatch" the datasets to the 
correct folder according to that field of the tuple.

But at first that seemed too complicated (premature optimization) so I coded my 
job the "naïve" way then splitted it because I thought that there would be some 
sort of recycling of those buffers between jobs.

If the recycling works in normal conditions maybe the question is whether it 
also works when multiple jobs are ran from within the same jar VS running the 
jar multiple times ?




PS: I don't have log files at hand, the app is ran on a separate and secured 
platform. Sure, I could try to reproduce the issue with a mock app but I'm not 
sure it would help the discussion.

Reply via email to