I didn't know such difference! Thus, Flink is very smart :) Thank for the explanation Robert.
On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[email protected]> wrote: > Vasia is working on support for reading directories recursively. But I > thought that this is also allowing you to simulate something like an append. > > Did you notice an issue when reading many small files with Flink? Flink is > handling the reading of files differently than Spark. > > Spark basically starts a task for each file / file split. So if you have > millions of small files in your HDFS, spark will start millions of tasks > (queued however). You need to coalesce in spark to reduce the number of > partitions. by default, they re-use the partitions of the preceding > operator. > Flink on the other hand is starting a fixed number of tasks which are > reading multiple input splits which are lazily assigned to these tasks once > they ready to process new splits. > Flink will not create a partition for each (small) input file. I expect > Flink to handle that case a bit better than Spark (I haven't tested it > though) > > > > On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[email protected]> > wrote: > >> Great! Append data to HDFS will be a very useful feature! >> I think that then you should think also how to read efficiently >> directories containing a lot of small files. I know that this can be quite >> inefficient so that's why in Spark they give you a coalesce operation to be >> able to deal siwth such cases.. >> >> >> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri < >> [email protected]> wrote: >> >>> Hi! >>> >>> Yes, I took a look into this. I hope I'll be able to find some time to >>> work on it this week. >>> I'll keep you updated :) >>> >>> Cheers, >>> V. >>> >>> On 9 December 2014 at 14:03, Robert Metzger <[email protected]> wrote: >>> >>>> It seems that Vasia started working on adding support for recursive >>>> reading: https://issues.apache.org/jira/browse/FLINK-1307. >>>> I'm still occupied with refactoring the YARN client, the HDFS >>>> refactoring is next on my list. >>>> >>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier < >>>> [email protected]> wrote: >>>> >>>>> Any news about this Robert? >>>>> >>>>> Thanks in advance, >>>>> Flavio >>>>> >>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I think there is no support for appending to HDFS files in Flink yet. >>>>>> HDFS supports it, but there are some adjustments in the system >>>>>> required (not deleting / creating directories before writing; exposing >>>>>> the >>>>>> append() methods in the FS abstractions). >>>>>> >>>>>> I'm planning to work on the FS abstractions in the next week, if I >>>>>> have enough time, I can also look into adding support for append(). >>>>>> >>>>>> Another approach could be adding support for recursively reading >>>>>> directories with the input formats. Vasia asked for this feature a few >>>>>> days >>>>>> ago on the mailing list. If we would have that feature, you could just >>>>>> write to a directory and read the parent directory (with all the dirs for >>>>>> the appends). >>>>>> >>>>>> Best, >>>>>> Robert >>>>>> >>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi guys, >>>>>>> how can I efficiently appends data (as plain strings or also avro >>>>>>> records) to HDFS using Flink? >>>>>>> Do I need to use Flume or can I avoid it? >>>>>>> >>>>>>> Thanks in advance, >>>>>>> Flavio >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
