Re: Kafka Scaling Ideas

2020-12-21 Thread Haruki Okada
About load test:
I think it'd be better to monitor per-message process latency and estimate
required partition count based on it because it determines the max
throughput per single partition.
- Say you have to process 12 million messages/hour =  messages/sec .
- If you have 7 partitions (thus 7 parallel consumers at maximum), single
consumer should process  / 7 = 476 messages/sec
- It means, process latency per single message should be lower than 2.1
milliseconds (1000 / 476)
  => If you have 14 partitions, it becomes 4.2 milliseconds

So required partition count can be calculated by per-message process
latency. (I think Spring-Kafka can be easily integrated with prometheus so
you can use it to measure that)

About increasing instance count:
- It depends on current system resource usage.
  * If the system resource is not so busy (likely because the consumer just
almost waits DB-write to return), you don't need to increase consumer
instances
  * But I think you should make sure that single consumer instance isn't
assigned multiple partitions to fully parallelize consumption across
partitions. (If I remember correctly, ConcurrentMessageListenerContainer
has a property to configure the concurrency)

2020年12月21日(月) 15:51 Yana K :

> So as the next step I see to increase the partition of the 2nd topic - do I
> increase the instances of the consumer from that or keep it at 7?
> Anything else (besides researching those libs)?
>
> Are there any good tools for load testing kafka?
>
> On Sun, Dec 20, 2020 at 7:23 PM Haruki Okada  wrote:
>
> > It depends on how you manually commit offsets.
> > Auto-commit does commits offsets in async manner basically, so as long as
> > you do manual-commit in the same way,  there should be no much
> difference.
> >
> > And, generally offset-commit mode doesn't make much difference in
> > performance regardless manual/auto or async/sync unless offset-commit
> > latency takes significant amount in processing time (e.g. you commit
> > offsets synchronously in every poll() loop).
> >
> > 2020年12月21日(月) 11:08 Yana K :
> >
> > > Thank you so much Marina and Haruka.
> > >
> > > Marina's response:
> > > - When you say " if you are sure there is no room for perf optimization
> > of
> > > the processing itself :" - do you mean code level optimizations? Can
> you
> > > please explain?
> > > - On the second topic you say " I'd say at least 40" - is this based on
> > 12
> > > million records / hour?
> > > -  "if you can change the incoming topic" - I don't think it is
> possible
> > :(
> > > -  "you could artificially achieve the same by adding one more step
> > > (service) in your pipeline" - this is the next thing - but I want to be
> > > sure this will help, given we've to maintain one more layer
> > >
> > > Haruka's response:
> > > - "One possible solution is creating an intermediate topic" - I already
> > did
> > > it
> > > - I'll look at Decaton - thx
> > >
> > > Is there any thoughts on the auto commit vs manual commit - if it can
> > > better the performance while consuming?
> > >
> > > Yana
> > >
> > >
> > >
> > > On Sat, Dec 19, 2020 at 7:01 PM Haruki Okada 
> > wrote:
> > >
> > > > Hi.
> > > >
> > > > Yeah, Spring-Kafka does processing messages sequentially, so the
> > consumer
> > > > throughput would be capped by database latency per single process.
> > > > One possible solution is creating an intermediate topic (or altering
> > > source
> > > > topic) with much more partitions as Marina suggested.
> > > >
> > > > I'd like to suggest another solution, that is multi-threaded
> processing
> > > per
> > > > single partition.
> > > > Decaton (https://github.com/line/decaton) is a library to achieve
> it.
> > > >
> > > > Also confluent has published a blog post about parallel-consumer (
> > > >
> > > >
> > >
> >
> https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/
> > > > )
> > > > for that purpose, but it seems it's still in the BETA stage.
> > > >
> > > > 2020年12月20日(日) 11:41 Marina Popova  .invalid>:
> > > >
> > > > > The way I see it - you can only do a few things - if you are sure
> > there
> > > > is
> > > > > no room for perf optimization of the processing itself :
> > > > > 1. speed up your processing per consumer thread: which you already
> > > tried
> > > > > by splitting your logic into a 2-step pipeline instead of 1-step,
> and
> > > > > delegating the work of writing to a DB to the second step ( make
> sure
> > > > your
> > > > > second intermediate Kafka topic is created with much more
> partitions
> > to
> > > > be
> > > > > able to parallelize your work much higher - I'd say at least 40)
> > > > > 2. if you can change the incoming topic - I would create it with
> many
> > > > more
> > > > > partitions as well - say at least 40 or so - to parallelize your
> > first
> > > > step
> > > > > service processing more
> > > > > 3. and if you can't increase partitions for the original topic ) -
> > you
> > > > > could artificially achie

Re: Kafka Scaling Ideas

2020-12-21 Thread Joris Peeters
Do you know why your consumers are so slow? 12E6msg/hour is msg/s,
which is not very high from a Kafka point-of-view. As you're doing database
inserts, I suspect that is where the bottleneck lies.

If, for example, you're doing a single-row insert in a SQL DB for every
message then this would incur a lot of overhead. Yes, you can somewhat
alleviate that by parallellising - i.e. increasing the partition count -
but it is also worth looking at batch inserts, if you aren't yet. Say, each
consumer waits for 1000 messages or 5 seconds to have passed (whichever
comes first) and then does a single bulk insert of the msgs it has
received, followed by a manual commit.

[A] you might already be doing this and [B] your DB of choice might not
support bulk inserts (although most do), but otherwise I'd expect this to
work a lot better than increasing the partition count.

On Mon, Dec 21, 2020 at 8:10 AM Haruki Okada  wrote:

> About load test:
> I think it'd be better to monitor per-message process latency and estimate
> required partition count based on it because it determines the max
> throughput per single partition.
> - Say you have to process 12 million messages/hour =  messages/sec .
> - If you have 7 partitions (thus 7 parallel consumers at maximum), single
> consumer should process  / 7 = 476 messages/sec
> - It means, process latency per single message should be lower than 2.1
> milliseconds (1000 / 476)
>   => If you have 14 partitions, it becomes 4.2 milliseconds
>
> So required partition count can be calculated by per-message process
> latency. (I think Spring-Kafka can be easily integrated with prometheus so
> you can use it to measure that)
>
> About increasing instance count:
> - It depends on current system resource usage.
>   * If the system resource is not so busy (likely because the consumer just
> almost waits DB-write to return), you don't need to increase consumer
> instances
>   * But I think you should make sure that single consumer instance isn't
> assigned multiple partitions to fully parallelize consumption across
> partitions. (If I remember correctly, ConcurrentMessageListenerContainer
> has a property to configure the concurrency)
>
> 2020年12月21日(月) 15:51 Yana K :
>
> > So as the next step I see to increase the partition of the 2nd topic -
> do I
> > increase the instances of the consumer from that or keep it at 7?
> > Anything else (besides researching those libs)?
> >
> > Are there any good tools for load testing kafka?
> >
> > On Sun, Dec 20, 2020 at 7:23 PM Haruki Okada 
> wrote:
> >
> > > It depends on how you manually commit offsets.
> > > Auto-commit does commits offsets in async manner basically, so as long
> as
> > > you do manual-commit in the same way,  there should be no much
> > difference.
> > >
> > > And, generally offset-commit mode doesn't make much difference in
> > > performance regardless manual/auto or async/sync unless offset-commit
> > > latency takes significant amount in processing time (e.g. you commit
> > > offsets synchronously in every poll() loop).
> > >
> > > 2020年12月21日(月) 11:08 Yana K :
> > >
> > > > Thank you so much Marina and Haruka.
> > > >
> > > > Marina's response:
> > > > - When you say " if you are sure there is no room for perf
> optimization
> > > of
> > > > the processing itself :" - do you mean code level optimizations? Can
> > you
> > > > please explain?
> > > > - On the second topic you say " I'd say at least 40" - is this based
> on
> > > 12
> > > > million records / hour?
> > > > -  "if you can change the incoming topic" - I don't think it is
> > possible
> > > :(
> > > > -  "you could artificially achieve the same by adding one more step
> > > > (service) in your pipeline" - this is the next thing - but I want to
> be
> > > > sure this will help, given we've to maintain one more layer
> > > >
> > > > Haruka's response:
> > > > - "One possible solution is creating an intermediate topic" - I
> already
> > > did
> > > > it
> > > > - I'll look at Decaton - thx
> > > >
> > > > Is there any thoughts on the auto commit vs manual commit - if it can
> > > > better the performance while consuming?
> > > >
> > > > Yana
> > > >
> > > >
> > > >
> > > > On Sat, Dec 19, 2020 at 7:01 PM Haruki Okada 
> > > wrote:
> > > >
> > > > > Hi.
> > > > >
> > > > > Yeah, Spring-Kafka does processing messages sequentially, so the
> > > consumer
> > > > > throughput would be capped by database latency per single process.
> > > > > One possible solution is creating an intermediate topic (or
> altering
> > > > source
> > > > > topic) with much more partitions as Marina suggested.
> > > > >
> > > > > I'd like to suggest another solution, that is multi-threaded
> > processing
> > > > per
> > > > > single partition.
> > > > > Decaton (https://github.com/line/decaton) is a library to achieve
> > it.
> > > > >
> > > > > Also confluent has published a blog post about parallel-consumer (
> > > > >
> > > > >
> > > >
> > >
> >
> https://www.confluent.io/blog/introduci

Kafka in-sync replicas

2020-12-21 Thread Miroslav Tsvetanov
Hello everyone,

I`m running Kafka with 3 brokers with replication factor 3 and in-sync
replicas 2.
If I set on producer side *acks=all* how many brokers should acknowledge
the record?

Thanks in advance.

Best regards,
Miroslav


Re: Kafka in-sync replicas

2020-12-21 Thread Tom Bentley
The leader should send the produce response (the acknowledgement) to the
producer once the leader has persisted the batch to its log *and* the
leader knows that one of the followers has persisted it to its log.

On Mon, Dec 21, 2020 at 9:52 AM Miroslav Tsvetanov 
wrote:

> Hello everyone,
>
> I`m running Kafka with 3 brokers with replication factor 3 and in-sync
> replicas 2.
> If I set on producer side *acks=all* how many brokers should acknowledge
> the record?
>
> Thanks in advance.
>
> Best regards,
> Miroslav
>


Forwarding Kafka gc log to syslog servers

2020-12-21 Thread cool dharma06
Hi team,

I am trying to configure Kafka servers to forward the GC (garbage
collection) logs to Syslog servers.

But i couldn't find an option in Sysloghandler. Also I feel I don't want to
run any external Syslog service on the Kafka server and utilise Kafka
Syslog handlers.

When I went through the configuration/options I found like under
bin/kafa-run-class.sh

KAFKA_GC_LOGS_OPTS option.

But I couldn't find an configuration or option to forward to syslog servers.


Maybe I might missout some configuration options. If we have any option to
configure Kafka to forward Garbage Collection logs to Syslog servers it
will be more helpful.

Any suggestions on this.

Thanks & Regards,
cooldharma06


[ANNOUNCE] Apache Kafka 2.7.0

2020-12-21 Thread Bill Bejeck
The Apache Kafka community is pleased to announce the release for Apache
Kafka 2.7.0

* Configurable TCP connection timeout and improve the initial metadata fetch
* Enforce broker-wide and per-listener connection creation rate (KIP-612,
part 1)
* Throttle Create Topic, Create Partition and Delete Topic Operations
* Add TRACE-level end-to-end latency metrics to Streams
* Add Broker-side SCRAM Config API
* Support PEM format for SSL certificates and private key
* Add RocksDB Memory Consumption to RocksDB Metrics
* Add Sliding-Window support for Aggregations

This release also includes a few other features, 53 improvements, and 91
bug fixes.

All of the changes in this release can be found in the release notes:
https://www.apache.org/dist/kafka/2.7.0/RELEASE_NOTES.html

You can read about some of the more prominent changes in the Apache Kafka
blog:
https://blogs.apache.org/kafka/entry/what-s-new-in-apache4

You can download the source and binary release (Scala 2.12, 2.13) from:
https://kafka.apache.org/downloads#2.7.0

---


Apache Kafka is a distributed streaming platform with four core APIs:


** The Producer API allows an application to publish a stream records to
one or more Kafka topics.

** The Consumer API allows an application to subscribe to one or more
topics and process the stream of records produced to them.

** The Streams API allows an application to act as a stream processor,
consuming an input stream from one or more topics and producing an
output stream to one or more output topics, effectively transforming the
input streams to output streams.

** The Connector API allows building and running reusable producers or
consumers that connect Kafka topics to existing applications or data
systems. For example, a connector to a relational database might
capture every change to a table.


With these APIs, Kafka can be used for two broad classes of application:

** Building real-time streaming data pipelines that reliably get data
between systems or applications.

** Building real-time streaming applications that transform or react
to the streams of data.


Apache Kafka is in use at large and small companies worldwide, including
Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest, Rabobank,
Target, The New York Times, Uber, Yelp, and Zalando, among others.

A big thank you for the following 117 contributors to this release!

A. Sophie Blee-Goldman, Aakash Shah, Adam Bellemare, Adem Efe Gencer,
albert02lowis, Alex Diachenko, Andras Katona, Andre Araujo, Andrew Choi,
Andrew Egelhofer, Andy Coates, Ankit Kumar, Anna Povzner, Antony Stubbs,
Arjun Satish, Ashish Roy, Auston, Badai Aqrandista, Benoit Maggi, bill,
Bill Bejeck, Bob Barrett, Boyang Chen, Brian Byrne, Bruno Cadonna, Can
Cecen, Cheng Tan, Chia-Ping Tsai, Chris Egerton, Colin Patrick McCabe,
David Arthur, David Jacot, David Mao, Dhruvil Shah, Dima Reznik, Edoardo
Comar, Ego, Evelyn Bayes, feyman2016, Gal Margalit, gnkoshelev, Gokul
Srinivas, Gonzalo Muñoz, Greg Harris, Guozhang Wang, high.lee, huangyiming,
huxi, Igor Soarez, Ismael Juma, Ivan Yurchenko, Jason Gustafson, Jeff Kim,
jeff kim, Jesse Gorzinski, jiameixie, Jim Galasyn, JoelWee, John Roesler,
John Thomas, Jorge Esteban Quilcate Otoya, Julien Jean Paul Sirocchi,
Justine Olshan, khairy, Konstantine Karantasis, Kowshik Prakasam, leah, Lee
Dongjin, Leonard Ge, Levani Kokhreidze, Lucas Bradstreet, Lucent-Wong, Luke
Chen, Mandar Tillu, manijndl7, Manikumar Reddy, Mario Molina, Matthias J.
Sax, Micah Paul Ramos, Michael Bingham, Mickael Maison, Navina Ramesh,
Nikhil Bhatia, Nikolay, Nikolay Izhikov, Ning Zhang, Nitesh Mor, Noa
Resare, Rajini Sivaram, Raman Verma, Randall Hauch, Rens Groothuijsen,
Richard Fussenegger, Rob Meng, Rohan, Ron Dagostino, Sanjana Kaundinya,
Sasaki Toru, sbellapu, serjchebotarev, Shaik Zakir Hussain, Shailesh
Panwar, Sharath Bhat, showuon, Stanislav Kozlovski, Thorsten Hake, Tom
Bentley, tswstarplanet, vamossagar12, Vikas Singh, vinoth chandar, Vito
Jeng, voffcheg109, xakassi, Xavier Léauté, Yuriy Badalyantc, Zach Zhang

We welcome your help and feedback. For more information on how to
report problems, and to get involved, visit the project website at
https://kafka.apache.org/

Thank you!


Regards,
Bill Bejeck


Re: Kafka Scaling Ideas

2020-12-21 Thread Yana K
Thanks Haruki and Joris.

Haruki:
Thanks for the detailed calculations. Really appreciate it. What tool/lib
is used to load test kafka?
So we've one consumer group and running 7 instances of the application -
that should be good enough - correct?

Joris:
Great point.
DB insert is a bottleneck (and hence moved it to its own layer) - and we
are batching but wondering what is the best way to calculate the batch
size.

Thanks,
Yana

On Mon, Dec 21, 2020 at 1:39 AM Joris Peeters 
wrote:

> Do you know why your consumers are so slow? 12E6msg/hour is msg/s,
> which is not very high from a Kafka point-of-view. As you're doing database
> inserts, I suspect that is where the bottleneck lies.
>
> If, for example, you're doing a single-row insert in a SQL DB for every
> message then this would incur a lot of overhead. Yes, you can somewhat
> alleviate that by parallellising - i.e. increasing the partition count -
> but it is also worth looking at batch inserts, if you aren't yet. Say, each
> consumer waits for 1000 messages or 5 seconds to have passed (whichever
> comes first) and then does a single bulk insert of the msgs it has
> received, followed by a manual commit.
>
> [A] you might already be doing this and [B] your DB of choice might not
> support bulk inserts (although most do), but otherwise I'd expect this to
> work a lot better than increasing the partition count.
>
> On Mon, Dec 21, 2020 at 8:10 AM Haruki Okada  wrote:
>
> > About load test:
> > I think it'd be better to monitor per-message process latency and
> estimate
> > required partition count based on it because it determines the max
> > throughput per single partition.
> > - Say you have to process 12 million messages/hour =  messages/sec .
> > - If you have 7 partitions (thus 7 parallel consumers at maximum), single
> > consumer should process  / 7 = 476 messages/sec
> > - It means, process latency per single message should be lower than 2.1
> > milliseconds (1000 / 476)
> >   => If you have 14 partitions, it becomes 4.2 milliseconds
> >
> > So required partition count can be calculated by per-message process
> > latency. (I think Spring-Kafka can be easily integrated with prometheus
> so
> > you can use it to measure that)
> >
> > About increasing instance count:
> > - It depends on current system resource usage.
> >   * If the system resource is not so busy (likely because the consumer
> just
> > almost waits DB-write to return), you don't need to increase consumer
> > instances
> >   * But I think you should make sure that single consumer instance isn't
> > assigned multiple partitions to fully parallelize consumption across
> > partitions. (If I remember correctly, ConcurrentMessageListenerContainer
> > has a property to configure the concurrency)
> >
> > 2020年12月21日(月) 15:51 Yana K :
> >
> > > So as the next step I see to increase the partition of the 2nd topic -
> > do I
> > > increase the instances of the consumer from that or keep it at 7?
> > > Anything else (besides researching those libs)?
> > >
> > > Are there any good tools for load testing kafka?
> > >
> > > On Sun, Dec 20, 2020 at 7:23 PM Haruki Okada 
> > wrote:
> > >
> > > > It depends on how you manually commit offsets.
> > > > Auto-commit does commits offsets in async manner basically, so as
> long
> > as
> > > > you do manual-commit in the same way,  there should be no much
> > > difference.
> > > >
> > > > And, generally offset-commit mode doesn't make much difference in
> > > > performance regardless manual/auto or async/sync unless offset-commit
> > > > latency takes significant amount in processing time (e.g. you commit
> > > > offsets synchronously in every poll() loop).
> > > >
> > > > 2020年12月21日(月) 11:08 Yana K :
> > > >
> > > > > Thank you so much Marina and Haruka.
> > > > >
> > > > > Marina's response:
> > > > > - When you say " if you are sure there is no room for perf
> > optimization
> > > > of
> > > > > the processing itself :" - do you mean code level optimizations?
> Can
> > > you
> > > > > please explain?
> > > > > - On the second topic you say " I'd say at least 40" - is this
> based
> > on
> > > > 12
> > > > > million records / hour?
> > > > > -  "if you can change the incoming topic" - I don't think it is
> > > possible
> > > > :(
> > > > > -  "you could artificially achieve the same by adding one more step
> > > > > (service) in your pipeline" - this is the next thing - but I want
> to
> > be
> > > > > sure this will help, given we've to maintain one more layer
> > > > >
> > > > > Haruka's response:
> > > > > - "One possible solution is creating an intermediate topic" - I
> > already
> > > > did
> > > > > it
> > > > > - I'll look at Decaton - thx
> > > > >
> > > > > Is there any thoughts on the auto commit vs manual commit - if it
> can
> > > > > better the performance while consuming?
> > > > >
> > > > > Yana
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Dec 19, 2020 at 7:01 PM Haruki Okada 
> > > > wrote:
> > > > >
> > > > > > Hi.
> > > > >

Re: Kafka Scaling Ideas

2020-12-21 Thread Joris Peeters
I'd probably just do it by experiment for your concrete data.

Maybe generate a few million synthetic data rows, and for-each-batch insert
them into a dev DB, with an outer grid search over various candidate batch
sizes. You're looking to optimise for flat-out rows/s, so whichever batch
size wins (given a fixed number of total rows) is near-optimal.
You can repeat the exercise with N simultaneous threads to inspect how
batch sizes and multiple partitions P would interact (which might well be
sublinear in P in case of e.g. transactions etc).

On Mon, Dec 21, 2020 at 4:48 PM Yana K  wrote:

> Thanks Haruki and Joris.
>
> Haruki:
> Thanks for the detailed calculations. Really appreciate it. What tool/lib
> is used to load test kafka?
> So we've one consumer group and running 7 instances of the application -
> that should be good enough - correct?
>
> Joris:
> Great point.
> DB insert is a bottleneck (and hence moved it to its own layer) - and we
> are batching but wondering what is the best way to calculate the batch
> size.
>
> Thanks,
> Yana
>
> On Mon, Dec 21, 2020 at 1:39 AM Joris Peeters 
> wrote:
>
> > Do you know why your consumers are so slow? 12E6msg/hour is msg/s,
> > which is not very high from a Kafka point-of-view. As you're doing
> database
> > inserts, I suspect that is where the bottleneck lies.
> >
> > If, for example, you're doing a single-row insert in a SQL DB for every
> > message then this would incur a lot of overhead. Yes, you can somewhat
> > alleviate that by parallellising - i.e. increasing the partition count -
> > but it is also worth looking at batch inserts, if you aren't yet. Say,
> each
> > consumer waits for 1000 messages or 5 seconds to have passed (whichever
> > comes first) and then does a single bulk insert of the msgs it has
> > received, followed by a manual commit.
> >
> > [A] you might already be doing this and [B] your DB of choice might not
> > support bulk inserts (although most do), but otherwise I'd expect this to
> > work a lot better than increasing the partition count.
> >
> > On Mon, Dec 21, 2020 at 8:10 AM Haruki Okada 
> wrote:
> >
> > > About load test:
> > > I think it'd be better to monitor per-message process latency and
> > estimate
> > > required partition count based on it because it determines the max
> > > throughput per single partition.
> > > - Say you have to process 12 million messages/hour =  messages/sec
> .
> > > - If you have 7 partitions (thus 7 parallel consumers at maximum),
> single
> > > consumer should process  / 7 = 476 messages/sec
> > > - It means, process latency per single message should be lower than 2.1
> > > milliseconds (1000 / 476)
> > >   => If you have 14 partitions, it becomes 4.2 milliseconds
> > >
> > > So required partition count can be calculated by per-message process
> > > latency. (I think Spring-Kafka can be easily integrated with prometheus
> > so
> > > you can use it to measure that)
> > >
> > > About increasing instance count:
> > > - It depends on current system resource usage.
> > >   * If the system resource is not so busy (likely because the consumer
> > just
> > > almost waits DB-write to return), you don't need to increase consumer
> > > instances
> > >   * But I think you should make sure that single consumer instance
> isn't
> > > assigned multiple partitions to fully parallelize consumption across
> > > partitions. (If I remember correctly,
> ConcurrentMessageListenerContainer
> > > has a property to configure the concurrency)
> > >
> > > 2020年12月21日(月) 15:51 Yana K :
> > >
> > > > So as the next step I see to increase the partition of the 2nd topic
> -
> > > do I
> > > > increase the instances of the consumer from that or keep it at 7?
> > > > Anything else (besides researching those libs)?
> > > >
> > > > Are there any good tools for load testing kafka?
> > > >
> > > > On Sun, Dec 20, 2020 at 7:23 PM Haruki Okada 
> > > wrote:
> > > >
> > > > > It depends on how you manually commit offsets.
> > > > > Auto-commit does commits offsets in async manner basically, so as
> > long
> > > as
> > > > > you do manual-commit in the same way,  there should be no much
> > > > difference.
> > > > >
> > > > > And, generally offset-commit mode doesn't make much difference in
> > > > > performance regardless manual/auto or async/sync unless
> offset-commit
> > > > > latency takes significant amount in processing time (e.g. you
> commit
> > > > > offsets synchronously in every poll() loop).
> > > > >
> > > > > 2020年12月21日(月) 11:08 Yana K :
> > > > >
> > > > > > Thank you so much Marina and Haruka.
> > > > > >
> > > > > > Marina's response:
> > > > > > - When you say " if you are sure there is no room for perf
> > > optimization
> > > > > of
> > > > > > the processing itself :" - do you mean code level optimizations?
> > Can
> > > > you
> > > > > > please explain?
> > > > > > - On the second topic you say " I'd say at least 40" - is this
> > based
> > > on
> > > > > 12
> > > > > > million records / hour?
> >

Re: Kafka Scaling Ideas

2020-12-21 Thread Yana K
Thanks!

Also are there any producer optimizations anyone can think of in this
scenario?



On Mon, Dec 21, 2020 at 8:58 AM Joris Peeters 
wrote:

> I'd probably just do it by experiment for your concrete data.
>
> Maybe generate a few million synthetic data rows, and for-each-batch insert
> them into a dev DB, with an outer grid search over various candidate batch
> sizes. You're looking to optimise for flat-out rows/s, so whichever batch
> size wins (given a fixed number of total rows) is near-optimal.
> You can repeat the exercise with N simultaneous threads to inspect how
> batch sizes and multiple partitions P would interact (which might well be
> sublinear in P in case of e.g. transactions etc).
>
> On Mon, Dec 21, 2020 at 4:48 PM Yana K  wrote:
>
> > Thanks Haruki and Joris.
> >
> > Haruki:
> > Thanks for the detailed calculations. Really appreciate it. What tool/lib
> > is used to load test kafka?
> > So we've one consumer group and running 7 instances of the application -
> > that should be good enough - correct?
> >
> > Joris:
> > Great point.
> > DB insert is a bottleneck (and hence moved it to its own layer) - and we
> > are batching but wondering what is the best way to calculate the batch
> > size.
> >
> > Thanks,
> > Yana
> >
> > On Mon, Dec 21, 2020 at 1:39 AM Joris Peeters <
> joris.mg.peet...@gmail.com>
> > wrote:
> >
> > > Do you know why your consumers are so slow? 12E6msg/hour is msg/s,
> > > which is not very high from a Kafka point-of-view. As you're doing
> > database
> > > inserts, I suspect that is where the bottleneck lies.
> > >
> > > If, for example, you're doing a single-row insert in a SQL DB for every
> > > message then this would incur a lot of overhead. Yes, you can somewhat
> > > alleviate that by parallellising - i.e. increasing the partition count
> -
> > > but it is also worth looking at batch inserts, if you aren't yet. Say,
> > each
> > > consumer waits for 1000 messages or 5 seconds to have passed (whichever
> > > comes first) and then does a single bulk insert of the msgs it has
> > > received, followed by a manual commit.
> > >
> > > [A] you might already be doing this and [B] your DB of choice might not
> > > support bulk inserts (although most do), but otherwise I'd expect this
> to
> > > work a lot better than increasing the partition count.
> > >
> > > On Mon, Dec 21, 2020 at 8:10 AM Haruki Okada 
> > wrote:
> > >
> > > > About load test:
> > > > I think it'd be better to monitor per-message process latency and
> > > estimate
> > > > required partition count based on it because it determines the max
> > > > throughput per single partition.
> > > > - Say you have to process 12 million messages/hour = 
> messages/sec
> > .
> > > > - If you have 7 partitions (thus 7 parallel consumers at maximum),
> > single
> > > > consumer should process  / 7 = 476 messages/sec
> > > > - It means, process latency per single message should be lower than
> 2.1
> > > > milliseconds (1000 / 476)
> > > >   => If you have 14 partitions, it becomes 4.2 milliseconds
> > > >
> > > > So required partition count can be calculated by per-message process
> > > > latency. (I think Spring-Kafka can be easily integrated with
> prometheus
> > > so
> > > > you can use it to measure that)
> > > >
> > > > About increasing instance count:
> > > > - It depends on current system resource usage.
> > > >   * If the system resource is not so busy (likely because the
> consumer
> > > just
> > > > almost waits DB-write to return), you don't need to increase consumer
> > > > instances
> > > >   * But I think you should make sure that single consumer instance
> > isn't
> > > > assigned multiple partitions to fully parallelize consumption across
> > > > partitions. (If I remember correctly,
> > ConcurrentMessageListenerContainer
> > > > has a property to configure the concurrency)
> > > >
> > > > 2020年12月21日(月) 15:51 Yana K :
> > > >
> > > > > So as the next step I see to increase the partition of the 2nd
> topic
> > -
> > > > do I
> > > > > increase the instances of the consumer from that or keep it at 7?
> > > > > Anything else (besides researching those libs)?
> > > > >
> > > > > Are there any good tools for load testing kafka?
> > > > >
> > > > > On Sun, Dec 20, 2020 at 7:23 PM Haruki Okada 
> > > > wrote:
> > > > >
> > > > > > It depends on how you manually commit offsets.
> > > > > > Auto-commit does commits offsets in async manner basically, so as
> > > long
> > > > as
> > > > > > you do manual-commit in the same way,  there should be no much
> > > > > difference.
> > > > > >
> > > > > > And, generally offset-commit mode doesn't make much difference in
> > > > > > performance regardless manual/auto or async/sync unless
> > offset-commit
> > > > > > latency takes significant amount in processing time (e.g. you
> > commit
> > > > > > offsets synchronously in every poll() loop).
> > > > > >
> > > > > > 2020年12月21日(月) 11:08 Yana K :
> > > > > >
> > > > > > > Thank you so much Marina and Haruka

--override option for bin/connect-distributed.sh

2020-12-21 Thread Aki Yoshida
Hi Kafka team,
I think the --override option of Kafka is very practical in starting
Kafka for various situations without changing the properties file. I
missed this feature in Kafka-Connect and I wanted to have it, so I
created a patch in this commit in my forked repo.
https://github.com/elakito/kafka/commit/1e54536598d1ce328d0aee10edb728270cc04af1

Could someone tell me if this is a good idea or a bad idea? If bad, is
there some alternative way to customise the properties? If good, can I
create a PR?
I would appreciate for your suggestion.
Thanks.
regards, aki


Re: In Memory State Store

2020-12-21 Thread John Roesler
Hi Navneeth,

Yes, you are correct. I think there are some opportunities for improvement 
there, but there are also reasons for it to be serialized in the in-memory 
store. 

Off the top of my head, we need to serialize stored data anyway to send it to 
the changelog. Also, even though the store is in memory, there is still a cache 
(which helps reduce the update rate downstream); the cache stores serialized 
data because it’s the same cache covering all store types, and because it 
allows us to estimate and bound memory usage. Also, there are times when we 
need to determine if the current value is the same as the prior value; and 
comparing serialized forms is more reliable than using the equals() method. 
Round-tripping through serialization also helps to limit the scope of mutable 
object bugs, which I have seen in some user code. Finally, it simplifies the 
overall API and internals to have one kind of store to plug in, rather than to 
have some byte stores and some object stores.

All that aside, I’ve always felt that it could be a significant performance 
advantage to do what you suggest. If you feel passionate about it and want to 
do some experiments, I’d be happy to provide some guidance. Just let me know!

Thanks,
John

On Sun, Dec 20, 2020, at 14:27, Navneeth Krishnan wrote:
> Hi All,
> 
> I have a question about the inMemoryKeyValue store. I was under the
> assumption that in-memory stores would not serialize the objects but when I
> looked into the implementation I see InMemoryKeyValueStore uses a
> NavigableMap of bytes which indicates the user objects have to be
> serialized and stored.
> 
> Am I missing something? Wouldn't this cause more serialization overhead for
> storing something in memory?
> 
> In my case I have a punctuator which reads all the entries in the state
> store and forwards the data. When there are around 10k entries it takes
> about 400ms to complete. I was trying to optimize this problem. I use kryo
> serde and the objects are bigger in size (approx 500 bytes).
> 
> Regards,
> Navneeth
>


Re: kafka-streams: interaction between max.poll.records and window expiration ?

2020-12-21 Thread John Roesler
Hi Mathieu,

I don’t think there would be any problem. Note that window expiry is computed 
against an internal clock called “stream time”, which is the max timestamp yet 
observed. This time is advanced per each record when that record is processed. 
There is a separate clock for each partition, so they will not affect each 
other.

I hope this helps,
John

On Sun, Dec 20, 2020, at 08:22, Mathieu D wrote:
> Hello there,
> 
> One of our input topics does not have so much traffic.
> Divided by the number of partitions, and given the default 'max.poll.records'
> setting (being 1000 if I understand the doc correctly), it could happen
> that fetching 1000 records at once, the event timestamps between the first
> and last record in the "batch" could be larger than some windows in my
> topology.
> 
> Could this have any impact on window expiration ?
> 
> Thanks
> Mathieu
>


Re: [ANNOUNCE] Apache Kafka 2.7.0

2020-12-21 Thread Gwen Shapira
woooh!!!

Great job on the release Bill and everyone!

On Mon, Dec 21, 2020 at 8:01 AM Bill Bejeck  wrote:
>
> The Apache Kafka community is pleased to announce the release for Apache
> Kafka 2.7.0
>
> * Configurable TCP connection timeout and improve the initial metadata fetch
> * Enforce broker-wide and per-listener connection creation rate (KIP-612,
> part 1)
> * Throttle Create Topic, Create Partition and Delete Topic Operations
> * Add TRACE-level end-to-end latency metrics to Streams
> * Add Broker-side SCRAM Config API
> * Support PEM format for SSL certificates and private key
> * Add RocksDB Memory Consumption to RocksDB Metrics
> * Add Sliding-Window support for Aggregations
>
> This release also includes a few other features, 53 improvements, and 91
> bug fixes.
>
> All of the changes in this release can be found in the release notes:
> https://www.apache.org/dist/kafka/2.7.0/RELEASE_NOTES.html
>
> You can read about some of the more prominent changes in the Apache Kafka
> blog:
> https://blogs.apache.org/kafka/entry/what-s-new-in-apache4
>
> You can download the source and binary release (Scala 2.12, 2.13) from:
> https://kafka.apache.org/downloads#2.7.0
>
> ---
>
>
> Apache Kafka is a distributed streaming platform with four core APIs:
>
>
> ** The Producer API allows an application to publish a stream records to
> one or more Kafka topics.
>
> ** The Consumer API allows an application to subscribe to one or more
> topics and process the stream of records produced to them.
>
> ** The Streams API allows an application to act as a stream processor,
> consuming an input stream from one or more topics and producing an
> output stream to one or more output topics, effectively transforming the
> input streams to output streams.
>
> ** The Connector API allows building and running reusable producers or
> consumers that connect Kafka topics to existing applications or data
> systems. For example, a connector to a relational database might
> capture every change to a table.
>
>
> With these APIs, Kafka can be used for two broad classes of application:
>
> ** Building real-time streaming data pipelines that reliably get data
> between systems or applications.
>
> ** Building real-time streaming applications that transform or react
> to the streams of data.
>
>
> Apache Kafka is in use at large and small companies worldwide, including
> Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest, Rabobank,
> Target, The New York Times, Uber, Yelp, and Zalando, among others.
>
> A big thank you for the following 117 contributors to this release!
>
> A. Sophie Blee-Goldman, Aakash Shah, Adam Bellemare, Adem Efe Gencer,
> albert02lowis, Alex Diachenko, Andras Katona, Andre Araujo, Andrew Choi,
> Andrew Egelhofer, Andy Coates, Ankit Kumar, Anna Povzner, Antony Stubbs,
> Arjun Satish, Ashish Roy, Auston, Badai Aqrandista, Benoit Maggi, bill,
> Bill Bejeck, Bob Barrett, Boyang Chen, Brian Byrne, Bruno Cadonna, Can
> Cecen, Cheng Tan, Chia-Ping Tsai, Chris Egerton, Colin Patrick McCabe,
> David Arthur, David Jacot, David Mao, Dhruvil Shah, Dima Reznik, Edoardo
> Comar, Ego, Evelyn Bayes, feyman2016, Gal Margalit, gnkoshelev, Gokul
> Srinivas, Gonzalo Muñoz, Greg Harris, Guozhang Wang, high.lee, huangyiming,
> huxi, Igor Soarez, Ismael Juma, Ivan Yurchenko, Jason Gustafson, Jeff Kim,
> jeff kim, Jesse Gorzinski, jiameixie, Jim Galasyn, JoelWee, John Roesler,
> John Thomas, Jorge Esteban Quilcate Otoya, Julien Jean Paul Sirocchi,
> Justine Olshan, khairy, Konstantine Karantasis, Kowshik Prakasam, leah, Lee
> Dongjin, Leonard Ge, Levani Kokhreidze, Lucas Bradstreet, Lucent-Wong, Luke
> Chen, Mandar Tillu, manijndl7, Manikumar Reddy, Mario Molina, Matthias J.
> Sax, Micah Paul Ramos, Michael Bingham, Mickael Maison, Navina Ramesh,
> Nikhil Bhatia, Nikolay, Nikolay Izhikov, Ning Zhang, Nitesh Mor, Noa
> Resare, Rajini Sivaram, Raman Verma, Randall Hauch, Rens Groothuijsen,
> Richard Fussenegger, Rob Meng, Rohan, Ron Dagostino, Sanjana Kaundinya,
> Sasaki Toru, sbellapu, serjchebotarev, Shaik Zakir Hussain, Shailesh
> Panwar, Sharath Bhat, showuon, Stanislav Kozlovski, Thorsten Hake, Tom
> Bentley, tswstarplanet, vamossagar12, Vikas Singh, vinoth chandar, Vito
> Jeng, voffcheg109, xakassi, Xavier Léauté, Yuriy Badalyantc, Zach Zhang
>
> We welcome your help and feedback. For more information on how to
> report problems, and to get involved, visit the project website at
> https://kafka.apache.org/
>
> Thank you!
>
>
> Regards,
> Bill Bejeck



-- 
Gwen Shapira
Engineering Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog


Re: [ANNOUNCE] Apache Kafka 2.7.0

2020-12-21 Thread Randall Hauch
Fantastic! Thanks for driving the release, Bill.

Congratulations to the whole Kafka community.

On Mon, Dec 21, 2020 at 5:55 PM Gwen Shapira  wrote:

> woooh!!!
>
> Great job on the release Bill and everyone!
>
> On Mon, Dec 21, 2020 at 8:01 AM Bill Bejeck  wrote:
> >
> > The Apache Kafka community is pleased to announce the release for Apache
> > Kafka 2.7.0
> >
> > * Configurable TCP connection timeout and improve the initial metadata
> fetch
> > * Enforce broker-wide and per-listener connection creation rate (KIP-612,
> > part 1)
> > * Throttle Create Topic, Create Partition and Delete Topic Operations
> > * Add TRACE-level end-to-end latency metrics to Streams
> > * Add Broker-side SCRAM Config API
> > * Support PEM format for SSL certificates and private key
> > * Add RocksDB Memory Consumption to RocksDB Metrics
> > * Add Sliding-Window support for Aggregations
> >
> > This release also includes a few other features, 53 improvements, and 91
> > bug fixes.
> >
> > All of the changes in this release can be found in the release notes:
> > https://www.apache.org/dist/kafka/2.7.0/RELEASE_NOTES.html
> >
> > You can read about some of the more prominent changes in the Apache Kafka
> > blog:
> > https://blogs.apache.org/kafka/entry/what-s-new-in-apache4
> >
> > You can download the source and binary release (Scala 2.12, 2.13) from:
> > https://kafka.apache.org/downloads#2.7.0
> >
> >
> ---
> >
> >
> > Apache Kafka is a distributed streaming platform with four core APIs:
> >
> >
> > ** The Producer API allows an application to publish a stream records to
> > one or more Kafka topics.
> >
> > ** The Consumer API allows an application to subscribe to one or more
> > topics and process the stream of records produced to them.
> >
> > ** The Streams API allows an application to act as a stream processor,
> > consuming an input stream from one or more topics and producing an
> > output stream to one or more output topics, effectively transforming the
> > input streams to output streams.
> >
> > ** The Connector API allows building and running reusable producers or
> > consumers that connect Kafka topics to existing applications or data
> > systems. For example, a connector to a relational database might
> > capture every change to a table.
> >
> >
> > With these APIs, Kafka can be used for two broad classes of application:
> >
> > ** Building real-time streaming data pipelines that reliably get data
> > between systems or applications.
> >
> > ** Building real-time streaming applications that transform or react
> > to the streams of data.
> >
> >
> > Apache Kafka is in use at large and small companies worldwide, including
> > Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest, Rabobank,
> > Target, The New York Times, Uber, Yelp, and Zalando, among others.
> >
> > A big thank you for the following 117 contributors to this release!
> >
> > A. Sophie Blee-Goldman, Aakash Shah, Adam Bellemare, Adem Efe Gencer,
> > albert02lowis, Alex Diachenko, Andras Katona, Andre Araujo, Andrew Choi,
> > Andrew Egelhofer, Andy Coates, Ankit Kumar, Anna Povzner, Antony Stubbs,
> > Arjun Satish, Ashish Roy, Auston, Badai Aqrandista, Benoit Maggi, bill,
> > Bill Bejeck, Bob Barrett, Boyang Chen, Brian Byrne, Bruno Cadonna, Can
> > Cecen, Cheng Tan, Chia-Ping Tsai, Chris Egerton, Colin Patrick McCabe,
> > David Arthur, David Jacot, David Mao, Dhruvil Shah, Dima Reznik, Edoardo
> > Comar, Ego, Evelyn Bayes, feyman2016, Gal Margalit, gnkoshelev, Gokul
> > Srinivas, Gonzalo Muñoz, Greg Harris, Guozhang Wang, high.lee,
> huangyiming,
> > huxi, Igor Soarez, Ismael Juma, Ivan Yurchenko, Jason Gustafson, Jeff
> Kim,
> > jeff kim, Jesse Gorzinski, jiameixie, Jim Galasyn, JoelWee, John Roesler,
> > John Thomas, Jorge Esteban Quilcate Otoya, Julien Jean Paul Sirocchi,
> > Justine Olshan, khairy, Konstantine Karantasis, Kowshik Prakasam, leah,
> Lee
> > Dongjin, Leonard Ge, Levani Kokhreidze, Lucas Bradstreet, Lucent-Wong,
> Luke
> > Chen, Mandar Tillu, manijndl7, Manikumar Reddy, Mario Molina, Matthias J.
> > Sax, Micah Paul Ramos, Michael Bingham, Mickael Maison, Navina Ramesh,
> > Nikhil Bhatia, Nikolay, Nikolay Izhikov, Ning Zhang, Nitesh Mor, Noa
> > Resare, Rajini Sivaram, Raman Verma, Randall Hauch, Rens Groothuijsen,
> > Richard Fussenegger, Rob Meng, Rohan, Ron Dagostino, Sanjana Kaundinya,
> > Sasaki Toru, sbellapu, serjchebotarev, Shaik Zakir Hussain, Shailesh
> > Panwar, Sharath Bhat, showuon, Stanislav Kozlovski, Thorsten Hake, Tom
> > Bentley, tswstarplanet, vamossagar12, Vikas Singh, vinoth chandar, Vito
> > Jeng, voffcheg109, xakassi, Xavier Léauté, Yuriy Badalyantc, Zach Zhang
> >
> > We welcome your help and feedback. For more information on how to
> > report problems, and to get involved, visit the project website at
> > https://kafka.apache.org/
> >
> > Thank you!
> >
> >
> > Regards,
> > Bill Bejeck
>
>
>
> --
> Gwen

Re: Kafka Scaling Ideas

2020-12-21 Thread Haruki Okada
About "first layer" right?
Then it's better to make sure that not get() the result of Producer#send()
for each message, because in that way, it spoils the ability of
producer-batching.
Kafka producer batches messages by default and it's very efficient, so if
you produce in async way, it rarely becomes a bottleneck in general.
> Also are there any producer optimizations

By the way, if "first layer" just filters then produces messages without
interacting with any other external DB, using KafkaStreams should be much
easier.

2020年12月22日(火) 3:27 Yana K :

> Thanks!
>
> Also are there any producer optimizations anyone can think of in this
> scenario?
>
>
>
> On Mon, Dec 21, 2020 at 8:58 AM Joris Peeters 
> wrote:
>
> > I'd probably just do it by experiment for your concrete data.
> >
> > Maybe generate a few million synthetic data rows, and for-each-batch
> insert
> > them into a dev DB, with an outer grid search over various candidate
> batch
> > sizes. You're looking to optimise for flat-out rows/s, so whichever batch
> > size wins (given a fixed number of total rows) is near-optimal.
> > You can repeat the exercise with N simultaneous threads to inspect how
> > batch sizes and multiple partitions P would interact (which might well be
> > sublinear in P in case of e.g. transactions etc).
> >
> > On Mon, Dec 21, 2020 at 4:48 PM Yana K  wrote:
> >
> > > Thanks Haruki and Joris.
> > >
> > > Haruki:
> > > Thanks for the detailed calculations. Really appreciate it. What
> tool/lib
> > > is used to load test kafka?
> > > So we've one consumer group and running 7 instances of the application
> -
> > > that should be good enough - correct?
> > >
> > > Joris:
> > > Great point.
> > > DB insert is a bottleneck (and hence moved it to its own layer) - and
> we
> > > are batching but wondering what is the best way to calculate the batch
> > > size.
> > >
> > > Thanks,
> > > Yana
> > >
> > > On Mon, Dec 21, 2020 at 1:39 AM Joris Peeters <
> > joris.mg.peet...@gmail.com>
> > > wrote:
> > >
> > > > Do you know why your consumers are so slow? 12E6msg/hour is
> msg/s,
> > > > which is not very high from a Kafka point-of-view. As you're doing
> > > database
> > > > inserts, I suspect that is where the bottleneck lies.
> > > >
> > > > If, for example, you're doing a single-row insert in a SQL DB for
> every
> > > > message then this would incur a lot of overhead. Yes, you can
> somewhat
> > > > alleviate that by parallellising - i.e. increasing the partition
> count
> > -
> > > > but it is also worth looking at batch inserts, if you aren't yet.
> Say,
> > > each
> > > > consumer waits for 1000 messages or 5 seconds to have passed
> (whichever
> > > > comes first) and then does a single bulk insert of the msgs it has
> > > > received, followed by a manual commit.
> > > >
> > > > [A] you might already be doing this and [B] your DB of choice might
> not
> > > > support bulk inserts (although most do), but otherwise I'd expect
> this
> > to
> > > > work a lot better than increasing the partition count.
> > > >
> > > > On Mon, Dec 21, 2020 at 8:10 AM Haruki Okada 
> > > wrote:
> > > >
> > > > > About load test:
> > > > > I think it'd be better to monitor per-message process latency and
> > > > estimate
> > > > > required partition count based on it because it determines the max
> > > > > throughput per single partition.
> > > > > - Say you have to process 12 million messages/hour = 
> > messages/sec
> > > .
> > > > > - If you have 7 partitions (thus 7 parallel consumers at maximum),
> > > single
> > > > > consumer should process  / 7 = 476 messages/sec
> > > > > - It means, process latency per single message should be lower than
> > 2.1
> > > > > milliseconds (1000 / 476)
> > > > >   => If you have 14 partitions, it becomes 4.2 milliseconds
> > > > >
> > > > > So required partition count can be calculated by per-message
> process
> > > > > latency. (I think Spring-Kafka can be easily integrated with
> > prometheus
> > > > so
> > > > > you can use it to measure that)
> > > > >
> > > > > About increasing instance count:
> > > > > - It depends on current system resource usage.
> > > > >   * If the system resource is not so busy (likely because the
> > consumer
> > > > just
> > > > > almost waits DB-write to return), you don't need to increase
> consumer
> > > > > instances
> > > > >   * But I think you should make sure that single consumer instance
> > > isn't
> > > > > assigned multiple partitions to fully parallelize consumption
> across
> > > > > partitions. (If I remember correctly,
> > > ConcurrentMessageListenerContainer
> > > > > has a property to configure the concurrency)
> > > > >
> > > > > 2020年12月21日(月) 15:51 Yana K :
> > > > >
> > > > > > So as the next step I see to increase the partition of the 2nd
> > topic
> > > -
> > > > > do I
> > > > > > increase the instances of the consumer from that or keep it at 7?
> > > > > > Anything else (besides researching those libs)?
> > > > > >
> > > > > > Are there any good 

Re: [ANNOUNCE] Apache Kafka 2.7.0

2020-12-21 Thread Guozhang Wang
Thank you Bill !

Congratulations to the community.

On Mon, Dec 21, 2020 at 4:08 PM Randall Hauch  wrote:

> Fantastic! Thanks for driving the release, Bill.
>
> Congratulations to the whole Kafka community.
>
> On Mon, Dec 21, 2020 at 5:55 PM Gwen Shapira  wrote:
>
> > woooh!!!
> >
> > Great job on the release Bill and everyone!
> >
> > On Mon, Dec 21, 2020 at 8:01 AM Bill Bejeck  wrote:
> > >
> > > The Apache Kafka community is pleased to announce the release for
> Apache
> > > Kafka 2.7.0
> > >
> > > * Configurable TCP connection timeout and improve the initial metadata
> > fetch
> > > * Enforce broker-wide and per-listener connection creation rate
> (KIP-612,
> > > part 1)
> > > * Throttle Create Topic, Create Partition and Delete Topic Operations
> > > * Add TRACE-level end-to-end latency metrics to Streams
> > > * Add Broker-side SCRAM Config API
> > > * Support PEM format for SSL certificates and private key
> > > * Add RocksDB Memory Consumption to RocksDB Metrics
> > > * Add Sliding-Window support for Aggregations
> > >
> > > This release also includes a few other features, 53 improvements, and
> 91
> > > bug fixes.
> > >
> > > All of the changes in this release can be found in the release notes:
> > > https://www.apache.org/dist/kafka/2.7.0/RELEASE_NOTES.html
> > >
> > > You can read about some of the more prominent changes in the Apache
> Kafka
> > > blog:
> > > https://blogs.apache.org/kafka/entry/what-s-new-in-apache4
> > >
> > > You can download the source and binary release (Scala 2.12, 2.13) from:
> > > https://kafka.apache.org/downloads#2.7.0
> > >
> > >
> >
> ---
> > >
> > >
> > > Apache Kafka is a distributed streaming platform with four core APIs:
> > >
> > >
> > > ** The Producer API allows an application to publish a stream records
> to
> > > one or more Kafka topics.
> > >
> > > ** The Consumer API allows an application to subscribe to one or more
> > > topics and process the stream of records produced to them.
> > >
> > > ** The Streams API allows an application to act as a stream processor,
> > > consuming an input stream from one or more topics and producing an
> > > output stream to one or more output topics, effectively transforming
> the
> > > input streams to output streams.
> > >
> > > ** The Connector API allows building and running reusable producers or
> > > consumers that connect Kafka topics to existing applications or data
> > > systems. For example, a connector to a relational database might
> > > capture every change to a table.
> > >
> > >
> > > With these APIs, Kafka can be used for two broad classes of
> application:
> > >
> > > ** Building real-time streaming data pipelines that reliably get data
> > > between systems or applications.
> > >
> > > ** Building real-time streaming applications that transform or react
> > > to the streams of data.
> > >
> > >
> > > Apache Kafka is in use at large and small companies worldwide,
> including
> > > Capital One, Goldman Sachs, ING, LinkedIn, Netflix, Pinterest,
> Rabobank,
> > > Target, The New York Times, Uber, Yelp, and Zalando, among others.
> > >
> > > A big thank you for the following 117 contributors to this release!
> > >
> > > A. Sophie Blee-Goldman, Aakash Shah, Adam Bellemare, Adem Efe Gencer,
> > > albert02lowis, Alex Diachenko, Andras Katona, Andre Araujo, Andrew
> Choi,
> > > Andrew Egelhofer, Andy Coates, Ankit Kumar, Anna Povzner, Antony
> Stubbs,
> > > Arjun Satish, Ashish Roy, Auston, Badai Aqrandista, Benoit Maggi, bill,
> > > Bill Bejeck, Bob Barrett, Boyang Chen, Brian Byrne, Bruno Cadonna, Can
> > > Cecen, Cheng Tan, Chia-Ping Tsai, Chris Egerton, Colin Patrick McCabe,
> > > David Arthur, David Jacot, David Mao, Dhruvil Shah, Dima Reznik,
> Edoardo
> > > Comar, Ego, Evelyn Bayes, feyman2016, Gal Margalit, gnkoshelev, Gokul
> > > Srinivas, Gonzalo Muñoz, Greg Harris, Guozhang Wang, high.lee,
> > huangyiming,
> > > huxi, Igor Soarez, Ismael Juma, Ivan Yurchenko, Jason Gustafson, Jeff
> > Kim,
> > > jeff kim, Jesse Gorzinski, jiameixie, Jim Galasyn, JoelWee, John
> Roesler,
> > > John Thomas, Jorge Esteban Quilcate Otoya, Julien Jean Paul Sirocchi,
> > > Justine Olshan, khairy, Konstantine Karantasis, Kowshik Prakasam, leah,
> > Lee
> > > Dongjin, Leonard Ge, Levani Kokhreidze, Lucas Bradstreet, Lucent-Wong,
> > Luke
> > > Chen, Mandar Tillu, manijndl7, Manikumar Reddy, Mario Molina, Matthias
> J.
> > > Sax, Micah Paul Ramos, Michael Bingham, Mickael Maison, Navina Ramesh,
> > > Nikhil Bhatia, Nikolay, Nikolay Izhikov, Ning Zhang, Nitesh Mor, Noa
> > > Resare, Rajini Sivaram, Raman Verma, Randall Hauch, Rens Groothuijsen,
> > > Richard Fussenegger, Rob Meng, Rohan, Ron Dagostino, Sanjana Kaundinya,
> > > Sasaki Toru, sbellapu, serjchebotarev, Shaik Zakir Hussain, Shailesh
> > > Panwar, Sharath Bhat, showuon, Stanislav Kozlovski, Thorsten Hake, Tom
> > > Bentley, tswstarplanet, vamossagar12, Vikas Singh, vi

Re: Kafka Scaling Ideas

2020-12-21 Thread Yana K
I thought about it but then we don't have much time - will it optimize
performance?

On Mon, Dec 21, 2020 at 4:16 PM Haruki Okada  wrote:

> About "first layer" right?
> Then it's better to make sure that not get() the result of Producer#send()
> for each message, because in that way, it spoils the ability of
> producer-batching.
> Kafka producer batches messages by default and it's very efficient, so if
> you produce in async way, it rarely becomes a bottleneck in general.
> > Also are there any producer optimizations
>
> By the way, if "first layer" just filters then produces messages without
> interacting with any other external DB, using KafkaStreams should be much
> easier.
>
> 2020年12月22日(火) 3:27 Yana K :
>
> > Thanks!
> >
> > Also are there any producer optimizations anyone can think of in this
> > scenario?
> >
> >
> >
> > On Mon, Dec 21, 2020 at 8:58 AM Joris Peeters <
> joris.mg.peet...@gmail.com>
> > wrote:
> >
> > > I'd probably just do it by experiment for your concrete data.
> > >
> > > Maybe generate a few million synthetic data rows, and for-each-batch
> > insert
> > > them into a dev DB, with an outer grid search over various candidate
> > batch
> > > sizes. You're looking to optimise for flat-out rows/s, so whichever
> batch
> > > size wins (given a fixed number of total rows) is near-optimal.
> > > You can repeat the exercise with N simultaneous threads to inspect how
> > > batch sizes and multiple partitions P would interact (which might well
> be
> > > sublinear in P in case of e.g. transactions etc).
> > >
> > > On Mon, Dec 21, 2020 at 4:48 PM Yana K  wrote:
> > >
> > > > Thanks Haruki and Joris.
> > > >
> > > > Haruki:
> > > > Thanks for the detailed calculations. Really appreciate it. What
> > tool/lib
> > > > is used to load test kafka?
> > > > So we've one consumer group and running 7 instances of the
> application
> > -
> > > > that should be good enough - correct?
> > > >
> > > > Joris:
> > > > Great point.
> > > > DB insert is a bottleneck (and hence moved it to its own layer) - and
> > we
> > > > are batching but wondering what is the best way to calculate the
> batch
> > > > size.
> > > >
> > > > Thanks,
> > > > Yana
> > > >
> > > > On Mon, Dec 21, 2020 at 1:39 AM Joris Peeters <
> > > joris.mg.peet...@gmail.com>
> > > > wrote:
> > > >
> > > > > Do you know why your consumers are so slow? 12E6msg/hour is
> > msg/s,
> > > > > which is not very high from a Kafka point-of-view. As you're doing
> > > > database
> > > > > inserts, I suspect that is where the bottleneck lies.
> > > > >
> > > > > If, for example, you're doing a single-row insert in a SQL DB for
> > every
> > > > > message then this would incur a lot of overhead. Yes, you can
> > somewhat
> > > > > alleviate that by parallellising - i.e. increasing the partition
> > count
> > > -
> > > > > but it is also worth looking at batch inserts, if you aren't yet.
> > Say,
> > > > each
> > > > > consumer waits for 1000 messages or 5 seconds to have passed
> > (whichever
> > > > > comes first) and then does a single bulk insert of the msgs it has
> > > > > received, followed by a manual commit.
> > > > >
> > > > > [A] you might already be doing this and [B] your DB of choice might
> > not
> > > > > support bulk inserts (although most do), but otherwise I'd expect
> > this
> > > to
> > > > > work a lot better than increasing the partition count.
> > > > >
> > > > > On Mon, Dec 21, 2020 at 8:10 AM Haruki Okada 
> > > > wrote:
> > > > >
> > > > > > About load test:
> > > > > > I think it'd be better to monitor per-message process latency and
> > > > > estimate
> > > > > > required partition count based on it because it determines the
> max
> > > > > > throughput per single partition.
> > > > > > - Say you have to process 12 million messages/hour = 
> > > messages/sec
> > > > .
> > > > > > - If you have 7 partitions (thus 7 parallel consumers at
> maximum),
> > > > single
> > > > > > consumer should process  / 7 = 476 messages/sec
> > > > > > - It means, process latency per single message should be lower
> than
> > > 2.1
> > > > > > milliseconds (1000 / 476)
> > > > > >   => If you have 14 partitions, it becomes 4.2 milliseconds
> > > > > >
> > > > > > So required partition count can be calculated by per-message
> > process
> > > > > > latency. (I think Spring-Kafka can be easily integrated with
> > > prometheus
> > > > > so
> > > > > > you can use it to measure that)
> > > > > >
> > > > > > About increasing instance count:
> > > > > > - It depends on current system resource usage.
> > > > > >   * If the system resource is not so busy (likely because the
> > > consumer
> > > > > just
> > > > > > almost waits DB-write to return), you don't need to increase
> > consumer
> > > > > > instances
> > > > > >   * But I think you should make sure that single consumer
> instance
> > > > isn't
> > > > > > assigned multiple partitions to fully parallelize consumption
> > across
> > > > > > partitions. (If I remember correctly,
> > 

Producer closed while allocating memory error

2020-12-21 Thread Dhirendra Singh
I am getting following error in kafka producer.

org.apache.kafka.common.KafkaException: Producer closed while allocating
memory at

org.apache.kafka.clients.producer.internals.BufferPool.allocate(BufferPool.java:151)
~[kafka-clients-2.5.0.jar:?] at

org.apache.kafka.clients.producer.internals.RecordAccumulator.append(RecordAccumulator.java:221)
~[kafka-clients-2.5.0.jar:?]

at
org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:941)
~[kafka-clients-2.5.0.jar:?] at

org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:862)
~[kafka-clients-2.5.0.jar:?]


What could be the reason for this error message?

Kafka version i am using is 2.5.0