Re: streaming hdfs sub folders

Martin Neumann Wed, 17 Feb 2016 12:43:13 -0800

I forgot to mention I'm using an AvroInputFormat to read the file (that
might be relevant how the flag needs to be applied)
See the code Snipped below:


DataStream<EndSongCleanedPq> inStream =
        env.readFile(new AvroInputFormat<EndSongCleanedPq>(new
Path(filePath), EndSongCleanedPq.class), filePath);


On Wed, Feb 17, 2016 at 7:33 PM, Martin Neumann <mneum...@sics.se> wrote:

> The program is a DataStream program, it usually it gets the data from
> kafka. It's an anomaly detection program that learns from the stream
> itself. The reason I want to read from files is to test different settings
> of the algorithm and compare them.
>
> I think I don't need to reply things in the exact order (wich is not
> possible with parallel reads anyway) and I have written the program so it
> can deal with out of order events.
> I only need the subfolders to be processed roughly in order. Its fine to
> process some stuff from 01 before everything from 00 is finished, if I get
> records from all 24 subfolders at the same time things will break though.
> If I set the flag will it try to get data from all sub dir's in parallel or
> will it go sub dir by sub dir?
>
> Also can you point me to some documentation or something where I can see
> how to set the Flag?
>
> cheers Martin
>
>
>
>
> On Wed, Feb 17, 2016 at 11:49 AM, Stephan Ewen <se...@apache.org> wrote:
>
>> Hi!
>>
>> Going through nested folders is pretty simple, there is a flag on the
>> FileInputFormat that makes sure those are read.
>>
>> Tricky is the part that all "00" files should be read before the "01"
>> files. If you still want parallel reads, that means you need to sync at
>> some point, wait for all parallel parts to finish with the "00" work before
>> anyone may start with the "01" work.
>>
>> Is your training program a DataStream or a DataSet program?`
>>
>> Stephan
>>
>> On Wed, Feb 17, 2016 at 1:16 AM, Martin Neumann <mneum...@sics.se> wrote:
>>
>>> Hi,
>>>
>>> I have a streaming machine learning job that usually runs with input
>>> from kafka. To tweak the models I need to run on some old data from HDFS.
>>>
>>> Unfortunately the data on HDFS is spread out over several subfolders.
>>> Basically I have a datum with one subfolder for each hour within those are
>>> the actual input files I'm interested in.
>>>
>>> Basically what I need is a source that goes through the subfolder in
>>> order and streams the files into the program. I'm using event timestamps so
>>> all files in 00 need to be processed before 01.
>>>
>>> Has anyone an idea on how to do this?
>>>
>>> cheers Martin
>>>
>>>
>>
>

Re: streaming hdfs sub folders

Reply via email to