Hi!
Have a look at the class-level comments in "InputFormat". They should
describe how input formats first generate splits (for parallelization) on
the master, and the workers open each split.
So you need something like this:
AvroInputFormat avroInputFormat = new
AvroInputFormat(new
I'm not very familiar with the inner workings of the InputFomat's. calling
.open() got rid of the Nullpointer but the stream still produces no output.
As a temporary solution I wrote a batch job that just unions all the
different datasets and puts them (sorted) into a single folder.
cheers
Hi Martin,
where is the null pointer exception thrown?
I think you didn't call the open() method of the AvroInputFormat. Maybe
that's the issue.
On Thu, Feb 18, 2016 at 5:01 PM, Martin Neumann wrote:
> I tried to implement your idea but I'm getting NullPointer exceptions from
I tried to implement your idea but I'm getting NullPointer exceptions from
the AvroInputFormat any Idea what I'm doing wrong?
See the code below:
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
I guess I need to set the parallelism for the FlatMap to 1 to make sure I
read one file at a time. The downside I see with this is that I will be not
able to read in parallel from HDFS (and the files are Huge).
I give it a try and see how much performance I loose.
cheers Martin
On Thu, Feb 18,
Martin,
I think you can approximate this in an easy way like this:
- On the client, you traverse your directories to collect all files that
you need, collect all file paths in a list.
- Then you have a source "env.fromElements(paths)".
- Then you flatMap and in the FlatMap, run the Avro
The program is a DataStream program, it usually it gets the data from
kafka. It's an anomaly detection program that learns from the stream
itself. The reason I want to read from files is to test different settings
of the algorithm and compare them.
I think I don't need to reply things in the
Hi!
Going through nested folders is pretty simple, there is a flag on the
FileInputFormat that makes sure those are read.
Tricky is the part that all "00" files should be read before the "01"
files. If you still want parallel reads, that means you need to sync at
some point, wait for all
Hi,
I have a streaming machine learning job that usually runs with input from
kafka. To tweak the models I need to run on some old data from HDFS.
Unfortunately the data on HDFS is spread out over several subfolders.
Basically I have a datum with one subfolder for each hour within those are
the