I forgot to mention I'm using an AvroInputFormat to read the file (that might be relevant how the flag needs to be applied) See the code Snipped below:
DataStream<EndSongCleanedPq> inStream = env.readFile(new AvroInputFormat<EndSongCleanedPq>(new Path(filePath), EndSongCleanedPq.class), filePath); On Wed, Feb 17, 2016 at 7:33 PM, Martin Neumann <mneum...@sics.se> wrote: > The program is a DataStream program, it usually it gets the data from > kafka. It's an anomaly detection program that learns from the stream > itself. The reason I want to read from files is to test different settings > of the algorithm and compare them. > > I think I don't need to reply things in the exact order (wich is not > possible with parallel reads anyway) and I have written the program so it > can deal with out of order events. > I only need the subfolders to be processed roughly in order. Its fine to > process some stuff from 01 before everything from 00 is finished, if I get > records from all 24 subfolders at the same time things will break though. > If I set the flag will it try to get data from all sub dir's in parallel or > will it go sub dir by sub dir? > > Also can you point me to some documentation or something where I can see > how to set the Flag? > > cheers Martin > > > > > On Wed, Feb 17, 2016 at 11:49 AM, Stephan Ewen <se...@apache.org> wrote: > >> Hi! >> >> Going through nested folders is pretty simple, there is a flag on the >> FileInputFormat that makes sure those are read. >> >> Tricky is the part that all "00" files should be read before the "01" >> files. If you still want parallel reads, that means you need to sync at >> some point, wait for all parallel parts to finish with the "00" work before >> anyone may start with the "01" work. >> >> Is your training program a DataStream or a DataSet program?` >> >> Stephan >> >> On Wed, Feb 17, 2016 at 1:16 AM, Martin Neumann <mneum...@sics.se> wrote: >> >>> Hi, >>> >>> I have a streaming machine learning job that usually runs with input >>> from kafka. To tweak the models I need to run on some old data from HDFS. >>> >>> Unfortunately the data on HDFS is spread out over several subfolders. >>> Basically I have a datum with one subfolder for each hour within those are >>> the actual input files I'm interested in. >>> >>> Basically what I need is a source that goes through the subfolder in >>> order and streams the files into the program. I'm using event timestamps so >>> all files in 00 need to be processed before 01. >>> >>> Has anyone an idea on how to do this? >>> >>> cheers Martin >>> >>> >> >