Re: Are Kafka and Kafka Streams right tools for my case?

2021-01-19 Thread Guozhang Wang
Hello,

I have observed several use cases similar to yours that are using Kafka /
Kafka Streams in production. That being said, your concerns are valid:

* Big messages: 5MB is indeed large, but not extremely big for Kafka. If it
is a single message of hundreds of MBs or over a GB then it's a bit
different handling story (and you may need to consider chunking it). You
would still need to make sure Kafka is configured right with
max.message.size etc to handle them, also you may need to tune your clients
on network sockets (like the buffer size, etc) for optimal networking
performance.

* External call in Streams: if the external service may be unavailable,
then your implementation should have a timeout scenario to either drop the
record or retry it later (some implementations would put it into a retry
queue, again stored as a Kafka topic, and then read from it later to
retry). Also note that Kafka Streams rely on consumer to poll the records,
and if that `poll` call is not triggered in time because of the external
API calls taking too long, you'd need to configure the poll.interval long
enough for this. Another caveat I can think of is that Kafka Streams at the
moment do not have async-processing capabilities, i.e. if a single record
taking too long for external call (or simply local IO call), then it would
block all records after it --- so if processing capability bottleneck is a
common case for you, you'd probably need to consider writing a custom
processor for async external calls yourself. In the future we do have plans
to support async processing in Kafka Streams though.



Guozhang




On Tue, Jan 19, 2021 at 8:44 AM The Real Preacher  wrote:

> I'm new to Kafka and will be grateful for any advice We are updating a
> legacy application together with moving it from IBM MQ to something
> different.
>
>
> Application currently does the following:
>
>   * Reads batch XML messages (up to 5 MB)
>   * Parses it to something meaningful
>   * Processes data parallelizing this procedure somehow manually for
> parts of the batch. Involves some external legacy API calls
> resulting in DB changes
>   * Sends several kinds of email notifications
>   * Sends some reply to some other queue
>   * input messages are profiled to disk
>
>
> We are considering using Kafka with Kafka Streams as it is nice to
>
>   * Scale processing easily
>   * Have messages persistently stored out of the box
>   * Built-in partitioning, replication, and fault-tolerance
>   * Confluent Schema Registry to let us move to schema-on-write
>   * Can be used for service-to-service communication for other
> applications as well
>
>
> But I have some concerns.
>
>
> We are thinking about splitting those huge messages logically and
> putting them to Kafka this way, as from how I understand it - Kafka is
> not a huge fan of big messages. Also it will let us parallelize
> processing on partition basis.
>
>
> After that use Kafka Streams for actual processing and further on for
> aggregating some batch responses back using state store. Also to push
> some messages to some other topics (e.g. for sending emails)
>
>
> But I wonder if it is a good idea to do actual processing in Kafka
> Streams at all, as it involves some external API calls?
>
>
> Also I'm not sure what is the best way to handle the cases when this
> external API is down for any reason. It means temporary failure for
> current and all the subsequent messages. Is there any way to stop Kafka
> Stream processing for some time? I can see that there are Pause and
> Resume methods on the Consumer API, can they be utilized somehow in
> Streams?
>
>
> Is it better to use a regular Kafka consumer here, possibly adding
> Streams as a next step to merge those batch messages together? Sounds
> like an overcomplication
>
>
> Is Kafka a good tool for these purposes at all?
>
> Cheers,
> TRP
>
>

-- 
-- Guozhang


Are Kafka and Kafka Streams right tools for my case?

2021-01-19 Thread The Real Preacher
I'm new to Kafka and will be grateful for any advice We are updating a 
legacy application together with moving it from IBM MQ to something 
different.



Application currently does the following:

 * Reads batch XML messages (up to 5 MB)
 * Parses it to something meaningful
 * Processes data parallelizing this procedure somehow manually for
   parts of the batch. Involves some external legacy API calls
   resulting in DB changes
 * Sends several kinds of email notifications
 * Sends some reply to some other queue
 * input messages are profiled to disk


We are considering using Kafka with Kafka Streams as it is nice to

 * Scale processing easily
 * Have messages persistently stored out of the box
 * Built-in partitioning, replication, and fault-tolerance
 * Confluent Schema Registry to let us move to schema-on-write
 * Can be used for service-to-service communication for other
   applications as well


But I have some concerns.


We are thinking about splitting those huge messages logically and 
putting them to Kafka this way, as from how I understand it - Kafka is 
not a huge fan of big messages. Also it will let us parallelize 
processing on partition basis.



After that use Kafka Streams for actual processing and further on for 
aggregating some batch responses back using state store. Also to push 
some messages to some other topics (e.g. for sending emails)



But I wonder if it is a good idea to do actual processing in Kafka 
Streams at all, as it involves some external API calls?



Also I'm not sure what is the best way to handle the cases when this 
external API is down for any reason. It means temporary failure for 
current and all the subsequent messages. Is there any way to stop Kafka 
Stream processing for some time? I can see that there are Pause and 
Resume methods on the Consumer API, can they be utilized somehow in Streams?



Is it better to use a regular Kafka consumer here, possibly adding 
Streams as a next step to merge those batch messages together? Sounds 
like an overcomplication



Is Kafka a good tool for these purposes at all?

Cheers,
TRP



Re: does Kafka exactly-once guarantee also apply to kafka state stores?

2021-01-19 Thread Pushkar Deole
Is there also a way to avoid duplicates if the application consumer from
kafka topic and writes the events to database?
e.g. in case the application restarts while processing a batch read from
topic and few events already written to database, when application
restarts, those events are again consumed by another instance and written
back to database.

Could this behavior be avoided somehow without putting constraints on
database table?

On Wed, Jan 6, 2021 at 11:18 PM Matthias J. Sax  wrote:

> Well, that is exactly what I mean by "it depends on the state store
> implementation".
>
> For this case, you cannot get exactly-once.
>
> There are actually ideas to improve the implementation to support the
> case you describe, but there is no timeline for this change yet. Not
> even sure if there is already a Jira ticket...
>
>
> -Matthias
>
> On 1/6/21 2:32 AM, Pushkar Deole wrote:
> > The question is if we want to use state store of 3rd party, e.g. say
> Redis,
> > how can the store be consistent with rest of the system i.e. source and
> > destination topics...
> >
> > e.g. record is consumed from source, processed, state store updated with
> > some state, but before writing to destination there is failure
> > Now, in this case, with kafka state store, it will be wiped off the state
> > stored since the transaction failed.
> >
> > But with Redis, the state store is updated with the new state and there
> is
> > no way to revert back
> >
> > On Tue, Jan 5, 2021 at 11:11 PM Matthias J. Sax 
> wrote:
> >
> >> It depends on the store implementation. Atm, EOS for state store is
> >> achieved by re-creating the state store in case of failure from the
> >> changelog topic.
> >>
> >> For RocksDB stores, we wipe out the local state directories and create a
> >> new empty RocksDB and for in-memory stores the content is "lost" anyway
> >> when state is migrated, and we reply the changelog into an empty store
> >> before processing resumes.
> >>
> >>
> >> -Matthias
> >>
> >> On 1/5/21 6:27 AM, Alex Craig wrote:
> >>> I don't think he's asking about data-loss, but rather data consistency.
> >>> (in the event of an exception or application crash, will EOS ensure
> that
> >>> the state store data is consistent)  My understanding is that it DOES
> >> apply
> >>> to state stores as well, in the sense that a failure during processing
> >>> would mean that the commit wouldn't get flushed and therefore wouldn't
> >> get
> >>> double-counted once processing resumes and message is re-processed.
> >>> As far as using something other than RocksDB, I think as long as you
> are
> >>> implementing the state store API correctly you should be fine.  I did a
> >> POC
> >>> recently using Mongo state-stores with EOS enabled and it worked
> >> correctly,
> >>> even when I intentionally introduced failures and crashes.
> >>>
> >>> -alex
> >>>
> >>> On Tue, Jan 5, 2021 at 1:09 AM Ning Zhang 
> >> wrote:
> >>>
>  If there is a "change-log" topic to back up the state store, then it
> may
>  not lose data.
> 
>  Also, if the third party store is not "kafka community certified" (or
> >> not
>  well-maintained), it may have chances to lose data (in different
> ways).
> 
> 
> 
>  On 2021/01/05 04:56:12, Pushkar Deole  wrote:
> > In case we opt to choose some third party store instead of kafka's
> >> stores
> > for storing state (e.g. Redis cache or Ignite), then will we lose the
> > exactly-once guarantee provided by kafka and the state stores can be
> in
>  an
> > inconsistent state ?
> >
> > On Sat, Jan 2, 2021 at 4:56 AM Ning Zhang 
>  wrote:
> >
> >> The physical store behind "state store" is change-log kafka topic.
> In
> >> Kafka stream, if something fails in the middle, the "state store" is
> >> restored back to the state before the event happens at the first
> step
> >> /
> >> beginning of the stream.
> >>
> >>
> >>
> >> On 2020/12/31 08:48:16, Pushkar Deole  wrote:
> >>> Hi All,
> >>>
> >>> We use Kafka streams and may need to use exactly-once configuration
>  for
> >>> some of the use cases. Currently, the application uses either local
>  or
> >>> global state store to store state.
> >>>  So, the application will consume events from source kafka topic,
>  process
> >>> the events, for state stores it will use either local or global
> state
> >> store
> >>> of kafka, then produce events onto the destination topic.
> >>>
> >>> Question i have is: in the case of exactly-once setting, kafka
>  streams
> >>> guarantees that all actions happen or nothing happens. So, in this
>  case,
> >>> any state stored on the local or global state store will also be
>  counted
> >>> under 'all or nothing' guarantee e.g. if event is consumed and
> state
> >> store
> >>> is updated, however some issue occurs before event is produced on
> >>> destination topic then will 

Re: [EXTERNAL] SSL error while doing curl on kafka

2021-01-19 Thread Jose Manuel Vega Monroy
@Sachit

SEC_ERROR_UNTRUSTED_ISSUER --> problem with SSL certificate, unstrusted

So you would need CA certificate which issued into truststore used by curl for 
calls to trust.

Depending on OS could be in different location.

But not sure what you trying to do, if you really interested on Kafka client 
connection than curl.

Thanks

 
 
Jose Manuel Vega Monroy
Java Developer / Software Developer Engineer in Test
Direct: +0035 0 2008038 (Ext. 8038)
Email: jose.mon...@williamhill.com
William Hill | 6/1 Waterport Place | Gibraltar | GX11 1AA




On 19/01/2021, 09:44, "Sachit Murarka"  wrote:

Hello All,

I am doing curl o : of kafka. It is throwing below error post
applying SSL. Can you please check?

NSS error -8172 (SEC_ERROR_UNTRUSTED_ISSUER)
* Peer's certificate issuer has been marked as not trusted by the user.


Kind Regards,
Sachit Murarka


Confidentiality: The contents of this e-mail and any attachments transmitted 
with it are intended to be confidential to the intended recipient; and may be 
privileged or otherwise protected from disclosure. If you are not an intended 
recipient of this e-mail, do not duplicate or redistribute it by any means. 
Please delete it and any attachments and notify the sender that you have 
received it in error. This e-mail is sent by a William Hill PLC group company. 
The William Hill group companies include, among others, William Hill PLC 
(registered number 4212563), William Hill Organization Limited (registered 
number 278208), William Hill US HoldCo Inc, WHG (International) Limited 
(registered number 99191) and Mr Green Limited (registered number C43260). Each 
of William Hill PLC and William Hill Organization Limited is registered in 
England and Wales and has its registered office at 1 Bedford Avenue, London, 
WC1B 3AU, UK. William Hill U.S. HoldCo, Inc. is registered in Delaware and has 
its registered office at 1007 N. Orange Street, 9 Floor, Wilmington, New Castle 
County DE 19801 Delaware, United States of America. WHG (International) Limited 
is registered in Gibraltar and has its registered office at 6/1 Waterport 
Place, Gibraltar. Mr Green Limited is registered in Malta and has its 
registered office at Tagliaferro Business Centre, Level 7, 14 High Street, 
Sliema SLM 1549, Malta. Unless specifically indicated otherwise, the contents 
of this e-mail are subject to contract; and are not an official statement, and 
do not necessarily represent the views, of William Hill PLC, its subsidiaries 
or affiliated companies. Please note that neither William Hill PLC, nor its 
subsidiaries and affiliated companies can accept any responsibility for any 
viruses contained within this e-mail and it is your responsibility to scan any 
emails and their attachments. William Hill PLC, its subsidiaries and affiliated 
companies may monitor e-mail traffic data and also the content of e-mails for 
effective operation of the e-mail system, or for security, purposes.


SSL error while doing curl on kafka

2021-01-19 Thread Sachit Murarka
Hello All,

I am doing curl o : of kafka. It is throwing below error post
applying SSL. Can you please check?

NSS error -8172 (SEC_ERROR_UNTRUSTED_ISSUER)
* Peer's certificate issuer has been marked as not trusted by the user.


Kind Regards,
Sachit Murarka