Hello,

I have observed several use cases similar to yours that are using Kafka /
Kafka Streams in production. That being said, your concerns are valid:

* Big messages: 5MB is indeed large, but not extremely big for Kafka. If it
is a single message of hundreds of MBs or over a GB then it's a bit
different handling story (and you may need to consider chunking it). You
would still need to make sure Kafka is configured right with
max.message.size etc to handle them, also you may need to tune your clients
on network sockets (like the buffer size, etc) for optimal networking
performance.

* External call in Streams: if the external service may be unavailable,
then your implementation should have a timeout scenario to either drop the
record or retry it later (some implementations would put it into a retry
queue, again stored as a Kafka topic, and then read from it later to
retry). Also note that Kafka Streams rely on consumer to poll the records,
and if that `poll` call is not triggered in time because of the external
API calls taking too long, you'd need to configure the poll.interval long
enough for this. Another caveat I can think of is that Kafka Streams at the
moment do not have async-processing capabilities, i.e. if a single record
taking too long for external call (or simply local IO call), then it would
block all records after it --- so if processing capability bottleneck is a
common case for you, you'd probably need to consider writing a custom
processor for async external calls yourself. In the future we do have plans
to support async processing in Kafka Streams though.



Guozhang




On Tue, Jan 19, 2021 at 8:44 AM The Real Preacher <prea...@gmail.com> wrote:

> I'm new to Kafka and will be grateful for any advice We are updating a
> legacy application together with moving it from IBM MQ to something
> different.
>
>
> Application currently does the following:
>
>   * Reads batch XML messages (up to 5 MB)
>   * Parses it to something meaningful
>   * Processes data parallelizing this procedure somehow manually for
>     parts of the batch. Involves some external legacy API calls
>     resulting in DB changes
>   * Sends several kinds of email notifications
>   * Sends some reply to some other queue
>   * input messages are profiled to disk
>
>
> We are considering using Kafka with Kafka Streams as it is nice to
>
>   * Scale processing easily
>   * Have messages persistently stored out of the box
>   * Built-in partitioning, replication, and fault-tolerance
>   * Confluent Schema Registry to let us move to schema-on-write
>   * Can be used for service-to-service communication for other
>     applications as well
>
>
> But I have some concerns.
>
>
> We are thinking about splitting those huge messages logically and
> putting them to Kafka this way, as from how I understand it - Kafka is
> not a huge fan of big messages. Also it will let us parallelize
> processing on partition basis.
>
>
> After that use Kafka Streams for actual processing and further on for
> aggregating some batch responses back using state store. Also to push
> some messages to some other topics (e.g. for sending emails)
>
>
> But I wonder if it is a good idea to do actual processing in Kafka
> Streams at all, as it involves some external API calls?
>
>
> Also I'm not sure what is the best way to handle the cases when this
> external API is down for any reason. It means temporary failure for
> current and all the subsequent messages. Is there any way to stop Kafka
> Stream processing for some time? I can see that there are Pause and
> Resume methods on the Consumer API, can they be utilized somehow in
> Streams?
>
>
> Is it better to use a regular Kafka consumer here, possibly adding
> Streams as a next step to merge those batch messages together? Sounds
> like an overcomplication
>
>
> Is Kafka a good tool for these purposes at all?
>
> Cheers,
> TRP
>
>

-- 
-- Guozhang

Reply via email to