Re: HDFS append

Flavio Pompermaier Tue, 09 Dec 2014 06:41:49 -0800

I didn't know such difference! Thus, Flink is very smart :)
Thank for the explanation Robert.


On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[email protected]> wrote:

> Vasia is working on support for reading directories recursively. But I
> thought that this is also allowing you to simulate something like an append.
>
> Did you notice an issue when reading many small files with Flink? Flink is
> handling the reading of files differently than Spark.
>
> Spark basically starts a task for each file / file split. So if you have
> millions of small files in your HDFS, spark will start millions of tasks
> (queued however). You need to coalesce in spark to reduce the number of
> partitions. by default, they re-use the partitions of the preceding
> operator.
> Flink on the other hand is starting a fixed number of tasks which are
> reading multiple input splits which are lazily assigned to these tasks once
> they ready to process new splits.
> Flink will not create a partition for each (small) input file. I expect
> Flink to handle that case a bit better than Spark (I haven't tested it
> though)
>
>
>
> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[email protected]>
> wrote:
>
>> Great! Append data to HDFS will be a very useful feature!
>> I think that then you should think also how to read efficiently
>> directories containing a lot of small files. I know that this can be quite
>> inefficient so that's why in Spark they give you a coalesce operation to be
>> able to deal siwth such cases..
>>
>>
>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
>> [email protected]> wrote:
>>
>>> Hi!
>>>
>>> Yes, I took a look into this. I hope I'll be able to find some time to
>>> work on it this week.
>>> I'll keep you updated :)
>>>
>>> Cheers,
>>> V.
>>>
>>> On 9 December 2014 at 14:03, Robert Metzger <[email protected]> wrote:
>>>
>>>> It seems that Vasia started working on adding support for recursive
>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>>> I'm still occupied with refactoring the YARN client, the HDFS
>>>> refactoring is next on my list.
>>>>
>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>>> [email protected]> wrote:
>>>>
>>>>> Any news about this Robert?
>>>>>
>>>>> Thanks in advance,
>>>>> Flavio
>>>>>
>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I think there is no support for appending to HDFS files in Flink yet.
>>>>>> HDFS supports it, but there are some adjustments in the system
>>>>>> required (not deleting / creating directories before writing; exposing 
>>>>>> the
>>>>>> append() methods in the FS abstractions).
>>>>>>
>>>>>> I'm planning to work on the FS abstractions in the next week, if I
>>>>>> have enough time, I can also look into adding support for append().
>>>>>>
>>>>>> Another approach could be adding support for recursively reading
>>>>>> directories with the input formats. Vasia asked for this feature a few 
>>>>>> days
>>>>>> ago on the mailing list. If we would have that feature, you could just
>>>>>> write to a directory and read the parent directory (with all the dirs for
>>>>>> the appends).
>>>>>>
>>>>>> Best,
>>>>>> Robert
>>>>>>
>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>> how can I efficiently appends data (as plain strings or also avro
>>>>>>> records) to  HDFS using Flink?
>>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Flavio
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HDFS append

Reply via email to