Re: streaming hdfs sub folders

2016-02-26 Thread Stephan Ewen
Hi! Have a look at the class-level comments in "InputFormat". They should describe how input formats first generate splits (for parallelization) on the master, and the workers open each split. So you need something like this: AvroInputFormat avroInputFormat = new AvroInputFormat(new

Re: streaming hdfs sub folders

2016-02-23 Thread Martin Neumann
I'm not very familiar with the inner workings of the InputFomat's. calling .open() got rid of the Nullpointer but the stream still produces no output. As a temporary solution I wrote a batch job that just unions all the different datasets and puts them (sorted) into a single folder. cheers

Re: streaming hdfs sub folders

2016-02-19 Thread Robert Metzger
Hi Martin, where is the null pointer exception thrown? I think you didn't call the open() method of the AvroInputFormat. Maybe that's the issue. On Thu, Feb 18, 2016 at 5:01 PM, Martin Neumann wrote: > I tried to implement your idea but I'm getting NullPointer exceptions from

Re: streaming hdfs sub folders

2016-02-18 Thread Martin Neumann
I tried to implement your idea but I'm getting NullPointer exceptions from the AvroInputFormat any Idea what I'm doing wrong? See the code below: public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env =

Re: streaming hdfs sub folders

2016-02-18 Thread Martin Neumann
I guess I need to set the parallelism for the FlatMap to 1 to make sure I read one file at a time. The downside I see with this is that I will be not able to read in parallel from HDFS (and the files are Huge). I give it a try and see how much performance I loose. cheers Martin On Thu, Feb 18,

Re: streaming hdfs sub folders

2016-02-18 Thread Stephan Ewen
Martin, I think you can approximate this in an easy way like this: - On the client, you traverse your directories to collect all files that you need, collect all file paths in a list. - Then you have a source "env.fromElements(paths)". - Then you flatMap and in the FlatMap, run the Avro

Re: streaming hdfs sub folders

2016-02-17 Thread Martin Neumann
The program is a DataStream program, it usually it gets the data from kafka. It's an anomaly detection program that learns from the stream itself. The reason I want to read from files is to test different settings of the algorithm and compare them. I think I don't need to reply things in the

Re: streaming hdfs sub folders

2016-02-17 Thread Stephan Ewen
Hi! Going through nested folders is pretty simple, there is a flag on the FileInputFormat that makes sure those are read. Tricky is the part that all "00" files should be read before the "01" files. If you still want parallel reads, that means you need to sync at some point, wait for all

streaming hdfs sub folders

2016-02-16 Thread Martin Neumann
Hi, I have a streaming machine learning job that usually runs with input from kafka. To tweak the models I need to run on some old data from HDFS. Unfortunately the data on HDFS is spread out over several subfolders. Basically I have a datum with one subfolder for each hour within those are the