Hey Flavio, this pull request got merged: https://github.com/apache/incubator-flink/pull/260
With this, you now can simulate an append behavior with Flink: - You have a directory in HDFS where you put the files you want to append hdfs:///data/appendjob/ - each time you want to append something, you run your job and let it create a new directory in hdfs:///data/appendjob/, lets say hdfs:///data/appendjob/run-X/ - Now, you can instruct the job to read the full output by letting it recursively read hdfs:///data/appendjob/. I hope that helps. Best, Robert On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <[email protected]> wrote: > > I didn't know such difference! Thus, Flink is very smart :) > Thank for the explanation Robert. > > On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[email protected]> > wrote: > >> Vasia is working on support for reading directories recursively. But I >> thought that this is also allowing you to simulate something like an append. >> >> Did you notice an issue when reading many small files with Flink? Flink >> is handling the reading of files differently than Spark. >> >> Spark basically starts a task for each file / file split. So if you have >> millions of small files in your HDFS, spark will start millions of tasks >> (queued however). You need to coalesce in spark to reduce the number of >> partitions. by default, they re-use the partitions of the preceding >> operator. >> Flink on the other hand is starting a fixed number of tasks which are >> reading multiple input splits which are lazily assigned to these tasks once >> they ready to process new splits. >> Flink will not create a partition for each (small) input file. I expect >> Flink to handle that case a bit better than Spark (I haven't tested it >> though) >> >> >> >> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[email protected]> >> wrote: >> >>> Great! Append data to HDFS will be a very useful feature! >>> I think that then you should think also how to read efficiently >>> directories containing a lot of small files. I know that this can be quite >>> inefficient so that's why in Spark they give you a coalesce operation to be >>> able to deal siwth such cases.. >>> >>> >>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri < >>> [email protected]> wrote: >>> >>>> Hi! >>>> >>>> Yes, I took a look into this. I hope I'll be able to find some time to >>>> work on it this week. >>>> I'll keep you updated :) >>>> >>>> Cheers, >>>> V. >>>> >>>> On 9 December 2014 at 14:03, Robert Metzger <[email protected]> >>>> wrote: >>>> >>>>> It seems that Vasia started working on adding support for recursive >>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307. >>>>> I'm still occupied with refactoring the YARN client, the HDFS >>>>> refactoring is next on my list. >>>>> >>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier < >>>>> [email protected]> wrote: >>>>> >>>>>> Any news about this Robert? >>>>>> >>>>>> Thanks in advance, >>>>>> Flavio >>>>>> >>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I think there is no support for appending to HDFS files in Flink >>>>>>> yet. >>>>>>> HDFS supports it, but there are some adjustments in the system >>>>>>> required (not deleting / creating directories before writing; exposing >>>>>>> the >>>>>>> append() methods in the FS abstractions). >>>>>>> >>>>>>> I'm planning to work on the FS abstractions in the next week, if I >>>>>>> have enough time, I can also look into adding support for append(). >>>>>>> >>>>>>> Another approach could be adding support for recursively reading >>>>>>> directories with the input formats. Vasia asked for this feature a few >>>>>>> days >>>>>>> ago on the mailing list. If we would have that feature, you could just >>>>>>> write to a directory and read the parent directory (with all the dirs >>>>>>> for >>>>>>> the appends). >>>>>>> >>>>>>> Best, >>>>>>> Robert >>>>>>> >>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi guys, >>>>>>>> how can I efficiently appends data (as plain strings or also avro >>>>>>>> records) to HDFS using Flink? >>>>>>>> Do I need to use Flume or can I avoid it? >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> Flavio >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > >
