Re: HDFS append

Robert Metzger Mon, 15 Dec 2014 03:54:36 -0800

Hey Flavio,

this pull request got merged:
https://github.com/apache/incubator-flink/pull/260


With this, you now can simulate an append behavior with Flink:

- You have a directory in HDFS where you put the files you want to append
hdfs:///data/appendjob/
- each time you want to append something, you run your job and let it
create a new directory in hdfs:///data/appendjob/, lets
say hdfs:///data/appendjob/run-X/
- Now, you can instruct the job to read the full output by letting it
recursively read hdfs:///data/appendjob/.

I hope that helps.


Best,
Robert


On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <[email protected]>
wrote:
>
> I didn't know such difference! Thus, Flink is very smart :)
> Thank for the explanation Robert.
>
> On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[email protected]>
> wrote:
>
>> Vasia is working on support for reading directories recursively. But I
>> thought that this is also allowing you to simulate something like an append.
>>
>> Did you notice an issue when reading many small files with Flink? Flink
>> is handling the reading of files differently than Spark.
>>
>> Spark basically starts a task for each file / file split. So if you have
>> millions of small files in your HDFS, spark will start millions of tasks
>> (queued however). You need to coalesce in spark to reduce the number of
>> partitions. by default, they re-use the partitions of the preceding
>> operator.
>> Flink on the other hand is starting a fixed number of tasks which are
>> reading multiple input splits which are lazily assigned to these tasks once
>> they ready to process new splits.
>> Flink will not create a partition for each (small) input file. I expect
>> Flink to handle that case a bit better than Spark (I haven't tested it
>> though)
>>
>>
>>
>> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[email protected]>
>> wrote:
>>
>>> Great! Append data to HDFS will be a very useful feature!
>>> I think that then you should think also how to read efficiently
>>> directories containing a lot of small files. I know that this can be quite
>>> inefficient so that's why in Spark they give you a coalesce operation to be
>>> able to deal siwth such cases..
>>>
>>>
>>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
>>> [email protected]> wrote:
>>>
>>>> Hi!
>>>>
>>>> Yes, I took a look into this. I hope I'll be able to find some time to
>>>> work on it this week.
>>>> I'll keep you updated :)
>>>>
>>>> Cheers,
>>>> V.
>>>>
>>>> On 9 December 2014 at 14:03, Robert Metzger <[email protected]>
>>>> wrote:
>>>>
>>>>> It seems that Vasia started working on adding support for recursive
>>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>>>> I'm still occupied with refactoring the YARN client, the HDFS
>>>>> refactoring is next on my list.
>>>>>
>>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Any news about this Robert?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Flavio
>>>>>>
>>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think there is no support for appending to HDFS files in Flink
>>>>>>> yet.
>>>>>>> HDFS supports it, but there are some adjustments in the system
>>>>>>> required (not deleting / creating directories before writing; exposing 
>>>>>>> the
>>>>>>> append() methods in the FS abstractions).
>>>>>>>
>>>>>>> I'm planning to work on the FS abstractions in the next week, if I
>>>>>>> have enough time, I can also look into adding support for append().
>>>>>>>
>>>>>>> Another approach could be adding support for recursively reading
>>>>>>> directories with the input formats. Vasia asked for this feature a few 
>>>>>>> days
>>>>>>> ago on the mailing list. If we would have that feature, you could just
>>>>>>> write to a directory and read the parent directory (with all the dirs 
>>>>>>> for
>>>>>>> the appends).
>>>>>>>
>>>>>>> Best,
>>>>>>> Robert
>>>>>>>
>>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>> how can I efficiently appends data (as plain strings or also avro
>>>>>>>> records) to  HDFS using Flink?
>>>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: HDFS append

Reply via email to