Re: Flink to ingest from Kafka to HDFS?

Rico Bergmann Tue, 25 Aug 2015 22:49:28 -0700

Hi!

Sorry, I won't be able to implement this soon. I just shared my ideas on this.


Greets. Rico. 



> Am 25.08.2015 um 17:52 schrieb Stephan Ewen <se...@apache.org>:
> 
> Hi Rico!
> 
> Can you give us an update on your status here? We actually need something 
> like this as well (and pretty urgent), so we would jump in
> and implement this, unless you have something already.
> 
> Stephan
> 
> 
>> On Thu, Aug 20, 2015 at 12:13 PM, Stephan Ewen <se...@apache.org> wrote:
>> BTW: This is becoming a dev discussion, maybe should move to that list...
>> 
>>> On Thu, Aug 20, 2015 at 12:12 PM, Stephan Ewen <se...@apache.org> wrote:
>>> Yes, one needs exactly a mechanism to seek the output stream back to the 
>>> last checkpointed position, in order to overwrite duplicates.
>>> 
>>> I think HDFS is not going to make this easy, there is basically no seek for 
>>> write. Not sure how to solve this, other then writing to tmp files and 
>>> copying upon success.
>>> 
>>> Apache Flume must have solved this issue in some way, it may be a worth 
>>> looking into how they solved it.
>>> 
>>>> On Thu, Aug 20, 2015 at 11:58 AM, Rico Bergmann <i...@ricobergmann.de> 
>>>> wrote:
>>>> My ideas for checkpointing:
>>>> 
>>>> I think writing to the destination should not depend on the checkpoint 
>>>> mechanism (otherwise the output would never be written to the destination 
>>>> if checkpointing is disabled). Instead I would keep the offsets of written 
>>>> and Checkpointed records. When recovering you would then somehow delete or 
>>>> overwrite the records after that offset. (But I don't really know whether 
>>>> this is as simple as I wrote it ;-) ). 
>>>> 
>>>> Regarding the rolling files I would suggest making the values of the 
>>>> user-defined partitioning function part of the path or file name. Writing 
>>>> records is then basically:
>>>> Extract the partition to write to, then add the record to a queue for this 
>>>> partition. Each queue has an output format assigned to it. On flushing the 
>>>> output file is opened, the content of the queue is written to it, and then 
>>>> closed.
>>>> 
>>>> Does this sound reasonable?
>>>> 
>>>> 
>>>> 
>>>>> Am 20.08.2015 um 10:40 schrieb Aljoscha Krettek <aljos...@apache.org>:
>>>>> 
>>>>> Yes, this seems like a good approach. We should probably no reuse the 
>>>>> KeySelector for this but maybe a more use-case specific type of function 
>>>>> that can create a desired filename from an input object.
>>>>> 
>>>>> This is only the first part, though. The hard bit would be implementing 
>>>>> rolling files and also integrating it with Flink's checkpointing 
>>>>> mechanism. For integration with checkpointing you could maybe use 
>>>>> "staging-files": all elements are put into a staging file. And then, when 
>>>>> the notification about a completed checkpoint is received the contents of 
>>>>> this file would me moved (or appended) to the actual destination.
>>>>> 
>>>>> Do you have any Ideas about the rolling files/checkpointing?
>>>>> 
>>>>>> On Thu, 20 Aug 2015 at 09:44 Rico Bergmann <i...@ricobergmann.de> wrote:
>>>>>> I'm thinking about implementing this. 
>>>>>> 
>>>>>> After looking into the flink code I would basically subclass 
>>>>>> FileOutputFormat in let's say KeyedFileOutputFormat, that gets an 
>>>>>> additional KeySelector object. The path in the file system is then 
>>>>>> appended by the string, the KeySelector returns. 
>>>>>> 
>>>>>> U think this is a good approach?
>>>>>> 
>>>>>> Greets. Rico. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Am 16.08.2015 um 19:56 schrieb Stephan Ewen <se...@apache.org>:
>>>>>>> 
>>>>>>> If you are up for it, this would be a very nice addition to Flink, a 
>>>>>>> great contribution :-)
>>>>>>> 
>>>>>>>> On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>>>>> Hi!
>>>>>>>> 
>>>>>>>> This should definitely be possible in Flink. Pretty much exactly like 
>>>>>>>> you describe it.
>>>>>>>> 
>>>>>>>> You need a custom version of the HDFS sink with some logic when to 
>>>>>>>> roll over to a new file.
>>>>>>>> 
>>>>>>>> You can also make the sink "exactly once" by integrating it with the 
>>>>>>>> checkpointing. For that, you would probably need to keep the current 
>>>>>>>> path and output stream offsets as of the last checkpoint, so you can 
>>>>>>>> resume from that offset and overwrite records to avoid duplicates. If 
>>>>>>>> that is not possible, you would probably buffer records between 
>>>>>>>> checkpoints and only write on checkpoints.
>>>>>>>> 
>>>>>>>> Greetings,
>>>>>>>> Stephan
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn <hpz...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Did anybody think of (mis-) using Flink streaming as an alternative 
>>>>>>>>> to Apache Flume just for ingesting data from Kafka (or other 
>>>>>>>>> streaming sources) to HDFS? Knowing that Flink can read from Kafka 
>>>>>>>>> and write to hdfs I assume it should be possible, but Is this a good 
>>>>>>>>> idea to do? 
>>>>>>>>> 
>>>>>>>>> Flume basically is about consuming data from somewhere, peeking into 
>>>>>>>>> each record and then directing it to a specific directory/file in 
>>>>>>>>> HDFS reliably. I've seen there is a FlumeSink, but would it be 
>>>>>>>>> possible to get the same functionality with
>>>>>>>>> Flink alone?
>>>>>>>>> 
>>>>>>>>> I've skimmed through the documentation and found the option to split 
>>>>>>>>> the output by key and the possibility to add multiple sinks. As I 
>>>>>>>>> understand, Flink programs are generally static, so it would not be 
>>>>>>>>> possible to add/remove sinks at runtime?
>>>>>>>>> So you would need to implement a custom sink directing the records to 
>>>>>>>>> different files based on a key (e.g. date)? Would it be difficult to 
>>>>>>>>> implement things like rolling outputs etc? Or better just use Flume?
>>>>>>>>> 
>>>>>>>>> Best, 
>>>>>>>>> Hans-Peter
>

Re: Flink to ingest from Kafka to HDFS?

Reply via email to