Re: output writer

Fabian Hueske Tue, 08 Sep 2015 07:50:33 -0700

Hi Michele,

you need to directly use a FileSystem client (e.g., Hadoop's) to create and
write to files. Have a look at the FileOutputFormat [1] which does this for
a single file per operator instance / partition. Instead of creating a
single file, you need to create one file for each key. However, you want to
avoid to have too many files open at a time but also avoid to create too
many files containing only a few records. If you use HDFS, this is
especially important, because HDFS is bad at handling many small files.
Only recent versions of HDFS support appending to files. If you have an
older version you have to create a new file for a key if you do not have an
open file handle for it.


There are multiple ways to control the number of open files and reduce the
number of files:

1) You can partition the data (as you already suggested) to move all
records with the same key to the same operator.
2) If you use the batch DataSet API you can sort the data using
sortPartition() such that each operator instance has only one file open at
a time.
3) Instead of doing a full sort, you could also use combineGroup() to
partially sort the data
4) Have a pool of open file handles and an LRU kind of eviction policy to
decide which file to close whenever you need open a new one.

Implementing this is not trivial. You can also organize the files per key
in folders. Have a look at the InitializeOnMaster and FinalizeOnMaster
hooks which are called once before a job is started and after all instance
of a task finished.

Let me know, if you need more information or if something is not clear.

Cheers, Fabian

[1]
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/io/FileOutputFormat.java

2015-09-08 12:33 GMT+02:00 Michele Bertoni <michele1.bert...@mail.polimi.it>
:

> Hi guys,
> sorry for late answer but I am still working to get this done but I don’t
> understand something
>
> I do have my own writeRecord function, but that function is not able to
> open new output stream or anything else so I don’t understand how to do that
>
> at first I think I should at least partition my data according to the
> output key (each key to one file)
> then I need to name the file exactly with that key
> but I don’t know how to go on
>
> thanks
> michele
>
>
>
> Il giorno 30/lug/2015, alle ore 12:53, Radu Tudoran <
> radu.tudo...@huawei.com> ha scritto:
>
> Re-hi,
>
> I have double –checked and actually there is an OutputFormat interface in
> flink which can be extended.
> I believe that for this kind of specific formats as mentioned by Michele,
> each can develop the appropriate format.
> On the other hand, having more outputformats I believe is something that
> could be contributed. We should identify a couple of common formats. The
> first one that comes in my mind is to have something for writing to memory
> (e.g. memory buffer)
>
>
>
> Dr. Radu Tudoran
> Research Engineer
> IT R&D Division
>
> <image001.png>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> European Research Center
> Riesstrasse 25, 80992 München
>
> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>*
> Mobile: +49 15209084330
> Telephone: +49 891588344173
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Jingwen TAO, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Jingwen TAO, Wanzhou MENG, Lifang CHEN
>
> *From:* Fabian Hueske [mailto:fhue...@gmail.com <fhue...@gmail.com>]
> *Sent:* Thursday, July 30, 2015 11:34 AM
> *To:* user@flink.apache.org
> *Subject:* Re: output writer
>
>
> Hi Michele, hi Radu
> Flink does not have such an OutputFormat, but I agree, it would be a
> valuable addition.
>
> Radu's approach looks like the way to go to implement this feature.
>
> @Radu, is there a way to contribute your OutputFormat to Flink?
> Cheers, Fabian
>
> 2015-07-30 10:24 GMT+02:00 Radu Tudoran <radu.tudo...@huawei.com>:
> Hi,
>
> My 2 cents ... based on something similar that I have tried.
> I have created an own implementation for OutputFormat where you define
> your own logic for what happens in the "writerecord function". This logic
> would consist in making a distinction between the ids and write each to the
> appropriate file
>
> Might be that other solutions exist
>
>
> Dr. Radu Tudoran
> Research Engineer
> IT R&D Division
>
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> European Research Center
> Riesstrasse 25, 80992 München
>
> E-mail: radu.tudo...@huawei.com
> Mobile: +49 15209084330
> Telephone: +49 891588344173
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Jingwen TAO, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Jingwen TAO, Wanzhou MENG, Lifang CHEN
>
> -----Original Message-----
> From: Michele Bertoni [mailto:michele1.bert...@mail.polimi.it]
> Sent: Thursday, July 30, 2015 10:15 AM
> To: user@flink.apache.org
> Subject: output writer
>
> Hi everybody,
> I have a question about the writer
> I have to save my dataset in different files according to a field of the
> tuples
>
> let’s assume I have a groupId in the tuple, I need to store each group in
> a different file, with a custom name: any idea on how i can do that?
>
>
> thanks!
> Michele
>
>
>

Re: output writer

Reply via email to