Thanks for your response Bret. I was able to do something similar to resolve the issue but I did not upgrade the cluster. I got lucky and did not ran into edge cases that are on 0.9.
On Wed, Jan 17, 2018 at 5:16 PM, Brett Rann <br...@zendesk.com.invalid> wrote: > There are several bugs in 0.9 around consumer offsets and compaction and > log cleaning. > > The easiest path forward is to upgrade to the latest 0.11.x. We ended up > going to somewhat extreme lengths to deal with 100GB+ consumer offsets. > > When we tested an upgrade we noticed that when it started compacting that > down (in the order of 45 minutes) it correlated with consumer group > coordinator errors, effectively causing consumers to be unable to consume > until the compacting finished. I don't know if that is still a bug. > > We got around this by adding a couple of new upgraded brokers to the > cluster and doing partition reassignments to it (as replicas) for the > affected __consumer_offset partitions, waiting for them to successfully > compact, and then doing a cut over which made the two replicated ones the > only copies which had the "live" brokers delete their large bugged ones, > then cutting back to the original assignments. It was also necessary to > then do some further partition reassignments to work around a bug where the > log cleaner thread died -- the move triggered a segment clean bypassing the > issue causing the log cleaner thread to die. That final one was reported > as a bug but it hasn't been looked at. The workaround is easy. > > Once we had all the offsets small again (generally < 100/150MB), proceeding > with the normal upgrade was painless. > > It took about 3 days to upgrade one of our clusters using this method, but > it resulted in no consumer down time which was all that mattered :) > > > > On Wed, Jan 17, 2018 at 10:07 AM, Shravan R <skr...@gmail.com> wrote: > > > BTW, I see log segments as old as last year and offsets.retention.minutes > > is set to 4 days. Any reason why it may have not been deleted? > > > > -rw-r--r-- 1 kafka kafka 104857532 Apr 5 2017 00000000000000000000.log > > -rw-r--r-- 1 kafka kafka 104857564 Apr 6 2017 00000000000001219197.log > > -rw-r--r-- 1 kafka kafka 104856962 Apr 6 2017 00000000000002438471.log > > -rw-r--r-- 1 kafka kafka 104857392 Apr 6 2017 00000000000003657738.log > > -rw-r--r-- 1 kafka kafka 104857564 Apr 6 2017 00000000000004877010.log > > -rw-r--r-- 1 kafka kafka 104857392 Apr 7 2017 00000000000006096284.log > > -rw-r--r-- 1 kafka kafka 104857478 Apr 7 2017 00000000000007315556.log > > -rw-r--r-- 1 kafka kafka 104857306 Apr 7 2017 00000000000008534829.log > > -rw-r--r-- 1 kafka kafka 104857134 Apr 7 2017 00000000000009754100.log > > -rw-r--r-- 1 kafka kafka 104857564 Apr 7 2017 00000000000010973369.log > > -rw-r--r-- 1 kafka kafka 104857564 Apr 7 2017 00000000000012192643.log > > -rw-r--r-- 1 kafka kafka 104857578 Apr 7 2017 00000000000013411917.log > > > > > > On Tue, Jan 16, 2018 at 1:04 PM, Shravan R <skr...@gmail.com> wrote: > > > > > I looked into it. I played with log.cleaner.dedupe.buffer.size between > > > 256MB to 2GB while keeping log.cleaner.threads=1 but that did not help > > > me. I helped me to recover from __consumer_offsets-33 but got into a > > > similar exception on another partition. There no lags on our system and > > > that is not a concern at this time. Is there any work around or tuning > > that > > > I can do? > > > > > > Thanks, > > > SK > > > > > > On Tue, Jan 16, 2018 at 10:56 AM, naresh Goud < > > nareshgoud.du...@gmail.com> > > > wrote: > > > > > >> Can you check if jira KAFKA-3894 helps? > > >> > > >> > > >> Thank you, > > >> Naresh > > >> > > >> On Tue, Jan 16, 2018 at 10:28 AM Shravan R <skr...@gmail.com> wrote: > > >> > > >> > We are running Kafka-0.9 and I am seeing large __consumer_offsets on > > >> some > > >> > of the partitions of the order of 100GB or more. I see some of the > log > > >> and > > >> > index files are more than a year old. I see the following > properties > > >> that > > >> > are of interest. > > >> > > > >> > offsets.retention.minutes=5769 (4 Days) > > >> > log.cleaner.dedupe.buffer.size=256000000 (256MB) > > >> > num.recovery.threads.per.data.dir=4 > > >> > log.cleaner.enable=true > > >> > log.cleaner.threads=1 > > >> > > > >> > > > >> > Upon restarting of the broker, I see the below exception which > clearly > > >> > indicates a problem with dedupe buffer size. However, I see the > dedupe > > >> > buffer size is set to 256MB which is far more than what the log > > >> complains > > >> > about (37MB). What could be the problem here? How can I get the > > offsets > > >> > topic size under manageable size? > > >> > > > >> > > > >> > 2018-01-15 21:26:51,434 ERROR kafka.log.LogCleaner: > > >> > [kafka-log-cleaner-thread-0], Error due to > > >> > java.lang.IllegalArgumentException: requirement failed: 990238234 > > >> messages > > >> > in segment __consumer_offsets-33/00000000000000000000.log but > offset > > >> map > > >> > can > > >> > fit only 37499999. You can increase log.cleaner.dedupe.buffer.size > or > > >> > decrease log.cleaner.threads > > >> > at scala.Predef$.require(Predef.scala:219) > > >> > at > > >> > kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply( > > LogCleaner.scala:591) > > >> > at > > >> > kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply( > > LogCleaner.scala:587) > > >> > at > > >> > > > >> > scala.collection.immutable.Stream$StreamWithFilter.foreach( > > >> Stream.scala:570) > > >> > at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:587) > > >> > at kafka.log.Cleaner.clean(LogCleaner.scala:329) > > >> > at > > >> > kafka.log.LogCleaner$CleanerThread.cleanOrSleep( > LogCleaner.scala:237) > > >> > at kafka.log.LogCleaner$CleanerThread.doWork( > > LogCleaner.scala: > > >> 215) > > >> > at kafka.utils.ShutdownableThread.run( > > ShutdownableThread.scala: > > >> 63) > > >> > 2018-01-15 21:26:51,436 INFO kafka.log.LogCleaner: > > >> > [kafka-log-cleaner-thread-0], Stopped > > >> > > > >> > > > >> > > > >> > Thanks, > > >> > -SK > > >> > > > >> > > > > > > > > >