Re: streaming hdfs sub folders

Martin Neumann Thu, 18 Feb 2016 07:06:52 -0800

I guess I need to set the parallelism for the FlatMap to 1 to make sure I
read one file at a time. The downside I see with this is that I will be not
able to read in parallel from HDFS (and the files are Huge).


I give it a try and see how much performance I loose.

cheers Martin

On Thu, Feb 18, 2016 at 2:32 PM, Stephan Ewen <se...@apache.org> wrote:

> Martin,
>
> I think you can approximate this in an easy way like this:
>
>   - On the client, you traverse your directories to collect all files that
> you need, collect all file paths in a list.
>   - Then you have a source "env.fromElements(paths)".
>   - Then you flatMap and in the FlatMap, run the Avro input format (open
> it per path, then call it to get all elements)
>
> That gives you pretty much full control about in which order the files
> should be processed.
>
> What do you think?
>
> Stephan
>
>
> On Wed, Feb 17, 2016 at 9:42 PM, Martin Neumann <mneum...@sics.se> wrote:
>
>> I forgot to mention I'm using an AvroInputFormat to read the file (that
>> might be relevant how the flag needs to be applied)
>> See the code Snipped below:
>>
>> DataStream<EndSongCleanedPq> inStream =
>>         env.readFile(new AvroInputFormat<EndSongCleanedPq>(new 
>> Path(filePath), EndSongCleanedPq.class), filePath);
>>
>>
>> On Wed, Feb 17, 2016 at 7:33 PM, Martin Neumann <mneum...@sics.se> wrote:
>>
>>> The program is a DataStream program, it usually it gets the data from
>>> kafka. It's an anomaly detection program that learns from the stream
>>> itself. The reason I want to read from files is to test different settings
>>> of the algorithm and compare them.
>>>
>>> I think I don't need to reply things in the exact order (wich is not
>>> possible with parallel reads anyway) and I have written the program so it
>>> can deal with out of order events.
>>> I only need the subfolders to be processed roughly in order. Its fine to
>>> process some stuff from 01 before everything from 00 is finished, if I get
>>> records from all 24 subfolders at the same time things will break though.
>>> If I set the flag will it try to get data from all sub dir's in parallel or
>>> will it go sub dir by sub dir?
>>>
>>> Also can you point me to some documentation or something where I can see
>>> how to set the Flag?
>>>
>>> cheers Martin
>>>
>>>
>>>
>>>
>>> On Wed, Feb 17, 2016 at 11:49 AM, Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> Hi!
>>>>
>>>> Going through nested folders is pretty simple, there is a flag on the
>>>> FileInputFormat that makes sure those are read.
>>>>
>>>> Tricky is the part that all "00" files should be read before the "01"
>>>> files. If you still want parallel reads, that means you need to sync at
>>>> some point, wait for all parallel parts to finish with the "00" work before
>>>> anyone may start with the "01" work.
>>>>
>>>> Is your training program a DataStream or a DataSet program?`
>>>>
>>>> Stephan
>>>>
>>>> On Wed, Feb 17, 2016 at 1:16 AM, Martin Neumann <mneum...@sics.se>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a streaming machine learning job that usually runs with input
>>>>> from kafka. To tweak the models I need to run on some old data from HDFS.
>>>>>
>>>>> Unfortunately the data on HDFS is spread out over several subfolders.
>>>>> Basically I have a datum with one subfolder for each hour within those are
>>>>> the actual input files I'm interested in.
>>>>>
>>>>> Basically what I need is a source that goes through the subfolder in
>>>>> order and streams the files into the program. I'm using event timestamps 
>>>>> so
>>>>> all files in 00 need to be processed before 01.
>>>>>
>>>>> Has anyone an idea on how to do this?
>>>>>
>>>>> cheers Martin
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: streaming hdfs sub folders

Reply via email to