Hi!

Going through nested folders is pretty simple, there is a flag on the
FileInputFormat that makes sure those are read.

Tricky is the part that all "00" files should be read before the "01"
files. If you still want parallel reads, that means you need to sync at
some point, wait for all parallel parts to finish with the "00" work before
anyone may start with the "01" work.

Is your training program a DataStream or a DataSet program?`

Stephan

On Wed, Feb 17, 2016 at 1:16 AM, Martin Neumann <mneum...@sics.se> wrote:

> Hi,
>
> I have a streaming machine learning job that usually runs with input from
> kafka. To tweak the models I need to run on some old data from HDFS.
>
> Unfortunately the data on HDFS is spread out over several subfolders.
> Basically I have a datum with one subfolder for each hour within those are
> the actual input files I'm interested in.
>
> Basically what I need is a source that goes through the subfolder in order
> and streams the files into the program. I'm using event timestamps so all
> files in 00 need to be processed before 01.
>
> Has anyone an idea on how to do this?
>
> cheers Martin
>
>

Reply via email to