Re: Why use Kafka after all?

Tzu-Li (Gordon) Tai Tue, 15 Nov 2016 23:24:13 -0800

Hi Matt,

Here’s an example of writing a DeserializationSchema for your POJOs: [1].

As for simply writing messages from WebSocket to Kafka using a Flink job, while
it is absolutely viable, I would not recommend it,
mainly because you’d never know if you might need to temporarily shut down
Flink jobs (perhaps for a version upgrade).

Shutting down the WebSocket consuming job, would then, of course, lead to
missing messages during the shutdown time.
It would be perhaps simpler if you have a separate Kafka producer application
to directly ingest messages from the WebSocket to Kafka.
You wouldn’t want this application to be down at all, so that all messages can
safely land into Kafka first. I would recommend to keep this part
as simple as possible.

From there, like Till explained, your Flink processing pipelines can rely on
Kafka’s replayability to provide exactly-once processing guarantees on your
data.

Best,
Gordon

[1]
https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/java/com/dataartisans/flinktraining/exercises/datastream_java/utils/TaxiRideSchema.java

On November 16, 2016 at 1:07:12 PM, Dromit (dromitl...@gmail.com) wrote:

"So in your case I would directly ingest my messages into Kafka"

I will do that through a custom SourceFunction that reads the messages from the
WebSocket, creates simple java objects (POJOs) and sink them in a Kafka topic
using a FlinkKafkaProducer, if that makes sense.

The problem now is I need a DeserializationSchema for my class. I read Flink is
able to de/serialize POJO objects by its own, but I'm still required to provide
a serializer to create the FlinkKafkaProducer (and FlinkKafkaConsumer).

Any idea or example? Should I create a DeserializationSchema for each POJO
class I want to put into a Kafka stream?

On Tue, Nov 15, 2016 at 7:43 AM, Till Rohrmann <trohrm...@apache.org> wrote:
Hi Matt,

as you've stated Flink is a stream processor and as such it needs to get its
inputs from somewhere. Flink can provide you up to exactly-once processing
guarantees. But in order to do this, it requires a re-playable source because
in case of a failure you might have to reprocess parts of the input you had
already processed prior to the failure. Kafka is such a source and people use
it because it happens to be one of the most popular and widespread open source
message queues/distributed logs.

If you don't require strong processing guarantees, then you can simply use the
WebSocket source. But, for any serious use case, you probably want to have
these guarantees because otherwise you just might calculate bogus results. So
in your case I would directly ingest my messages into Kafka and then let Flink
read from the created topic to do the processing.

Cheers,
Till

On Tue, Nov 15, 2016 at 8:14 AM, Dromit <dromitl...@gmail.com> wrote:
Hello,

As far as I've seen, there are a lot of projects using Flink and Kafka
together, but I'm not seeing the point of that. Let me know what you think
about this.

1. If I'm not wrong, Kafka provides basically two things: storage (records
retention) and fault tolerance in case of failure, while Flink mostly cares
about the transformation of such records. That means I can write a pipeline
with Flink alone, and even distribute it on a cluster, but in case of failure
some records may be lost, or I won't be able to reprocess the data if I change
the code, since the records are not kept in Flink by default (only when sinked
properly). Is that right?

2. In my use case the records come from a WebSocket and I create a custom class
based on messages on that socket. Should I put those records inside a Kafka
topic right away using a Flink custom source (SourceFunction) with a Kafka sink
(FlinkKafkaProducer), and independently create a Kafka source (KafkaConsumer)
for that topic and pipe the Flink transformations there? Is that data flow fine?

Basically what I'm trying to understand with both question is how and why
people are using Flink and Kafka.

Regards,
Matt

Re: Why use Kafka after all?

Reply via email to