Hi Matt,

Here’s an example of writing a DeserializationSchema for your POJOs: [1].

As for simply writing messages from WebSocket to Kafka using a Flink job, while 
it is absolutely viable, I would not recommend it,
mainly because you’d never know if you might need to temporarily shut down 
Flink jobs (perhaps for a version upgrade).

Shutting down the WebSocket consuming job, would then, of course, lead to 
missing messages during the shutdown time.
It would be perhaps simpler if you have a separate Kafka producer application 
to directly ingest messages from the WebSocket to Kafka.
You wouldn’t want this application to be down at all, so that all messages can 
safely land into Kafka first. I would recommend to keep this part
as simple as possible.

From there, like Till explained, your Flink processing pipelines can rely on 
Kafka’s replayability to provide exactly-once processing guarantees on your 
data.

Best,
Gordon


[1] 
https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/java/com/dataartisans/flinktraining/exercises/datastream_java/utils/TaxiRideSchema.java




On November 16, 2016 at 1:07:12 PM, Dromit (dromitl...@gmail.com) wrote:

"So in your case I would directly ingest my messages into Kafka"

I will do that through a custom SourceFunction that reads the messages from the 
WebSocket, creates simple java objects (POJOs) and sink them in a Kafka topic 
using a FlinkKafkaProducer, if that makes sense.

The problem now is I need a DeserializationSchema for my class. I read Flink is 
able to de/serialize POJO objects by its own, but I'm still required to provide 
a serializer to create the FlinkKafkaProducer (and FlinkKafkaConsumer).

Any idea or example? Should I create a DeserializationSchema for each POJO 
class I want to put into a Kafka stream?



On Tue, Nov 15, 2016 at 7:43 AM, Till Rohrmann <trohrm...@apache.org> wrote:
Hi Matt,

as you've stated Flink is a stream processor and as such it needs to get its 
inputs from somewhere. Flink can provide you up to exactly-once processing 
guarantees. But in order to do this, it requires a re-playable source because 
in case of a failure you might have to reprocess parts of the input you had 
already processed prior to the failure. Kafka is such a source and people use 
it because it happens to be one of the most popular and widespread open source 
message queues/distributed logs.

If you don't require strong processing guarantees, then you can simply use the 
WebSocket source. But, for any serious use case, you probably want to have 
these guarantees because otherwise you just might calculate bogus results. So 
in your case I would directly ingest my messages into Kafka and then let Flink 
read from the created topic to do the processing.

Cheers,
Till

On Tue, Nov 15, 2016 at 8:14 AM, Dromit <dromitl...@gmail.com> wrote:
Hello,

As far as I've seen, there are a lot of projects using Flink and Kafka 
together, but I'm not seeing the point of that. Let me know what you think 
about this.

1. If I'm not wrong, Kafka provides basically two things: storage (records 
retention) and fault tolerance in case of failure, while Flink mostly cares 
about the transformation of such records. That means I can write a pipeline 
with Flink alone, and even distribute it on a cluster, but in case of failure 
some records may be lost, or I won't be able to reprocess the data if I change 
the code, since the records are not kept in Flink by default (only when sinked 
properly). Is that right?

2. In my use case the records come from a WebSocket and I create a custom class 
based on messages on that socket. Should I put those records inside a Kafka 
topic right away using a Flink custom source (SourceFunction) with a Kafka sink 
(FlinkKafkaProducer), and independently create a Kafka source (KafkaConsumer) 
for that topic and pipe the Flink transformations there? Is that data flow fine?

Basically what I'm trying to understand with both question is how and why 
people are using Flink and Kafka.

Regards,
Matt


Reply via email to