Re: Frequent "offset out of range" messages, partitions deserted by consumer

2018-06-14 Thread Lawrence Weikum
unsubscribe 



Consumer offsets resetting

2017-04-12 Thread Lawrence Weikum
We’re seeing a strange error on our Kafka cluster. Users have reported their 
consumer offsets seem to periodically reset, which makes downstream processes 
handle a massive amount of duplicates.

Our Kafka cluster is currently running 0.10.1.1. Previously the cluster had 
been at 0.10.0.1. It was upgraded to 0.10.1.0, but we ran into multiple 
deadlocks so we downgraded to 0.10.0.1. We upgrade from 0.10.0.1 to 0.10.1.1.

This is the error we see. It only comes up every few weeks and only on some 
topics.

[2017-04-12 08:41:14,313] ERROR [ReplicaFetcherThread-1-8], Error for partition 
[email_priority,10] to broker 
8:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)

Has anyone seen this behavior before? Is there any way to prevent this from 
happening?

Thank you all!


Lawrence Weikum | Sr. Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
m 720.203.1578 | lwei...@pandora.com<mailto:lwei...@pandora.com>



Consumer offsets resetting

2017-04-12 Thread Lawrence Weikum
We’re seeing a strange error on our Kafka cluster. Users have reported their 
consumer offsets seem to periodically reset, which makes downstream processes 
handle a massive amount of duplicates.

Our Kafka cluster is currently running 0.10.1.1. Previously the cluster had 
been at 0.10.0.1. It was upgraded to 0.10.1.0, but we ran into multiple 
deadlocks so we downgraded to 0.10.0.1. We upgrade from 0.10.0.1 to 0.10.1.1.

This is the error we see. It only comes up every few weeks and only on some 
topics.

[2017-04-12 08:41:14,313] ERROR [ReplicaFetcherThread-1-8], Error for partition 
[email_priority,10] to broker 
8:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)

Has anyone seen this behavior before? Is there any way to prevent this from 
happening?

Thank you all!


Lawrence Weikum | Sr. Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
m 720.203.1578 | lwei...@pandora.com


Re: Deadlock using latest 0.10.1 Kafka release

2016-11-03 Thread Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.  
We’re now running on 0.10.1.0, and the FD increase is due to a deadlock, not 
functionality or new features.

Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
m 720.203.1578 | lwei...@pandora.com

On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io> wrote:

The 0.10.1 broker will use more file descriptor than previous releases
because of the new timestamp indexes. You should expect and plan for ~33%
more file descriptors to be open.

-hans

/**
 * Hans Jespersen, Principal Systems Engineer, Confluent Inc.
 * h...@confluent.io (650)924-2670
 */

On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <mjua...@gmail.com> wrote:

> We're running into a recurrent deadlock issue in both our production and
> staging clusters, both using the latest 0.10.1 release.  The symptom we
> noticed was that, in servers in which kafka producer connections are short
> lived, every other day or so, we'd see file descriptors being exhausted,
> until the broker is restarted, or the broker runs out of file descriptors,
> and it goes down.  None of the clients are on 0.10.1 kafka jars, they're
> all using previous versions.
>
> When diagnosing the issue, we found that when the system is in that state,
> using up file descriptors at a really fast rate, the JVM is actually in a
> deadlock.  Did a thread dump from both jstack and visualvm, and attached
> those to this email.
>
> This is the interesting bit from the jstack thread dump:
>
>
> Found one Java-level deadlock:
> =
> "executor-Heartbeat":
>   waiting to lock monitor 0x016c8138 (object 0x00062732a398, a
> kafka.coordinator.GroupMetadata),
>   which is held by "group-metadata-manager-0"
>
> "group-metadata-manager-0":
>   waiting to lock monitor 0x011ddaa8 (object 0x00063f1b0cc0, a
> java.util.LinkedList),
>   which is held by "kafka-request-handler-3"
>
> "kafka-request-handler-3":
>   waiting to lock monitor 0x016c8138 (object 0x00062732a398, a
> kafka.coordinator.GroupMetadata),
>   which is held by "group-metadata-manager-0"
>
>
> I also noticed the background heartbeat thread (I'm guessing the one
> called "executor-Heartbeat" above) is new for this release, under
> KAFKA-3888 ticket - 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-2D3888=CwIBaQ=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw=VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E=zJ2wVkapVi8N-jmDGRxM8a16nchqtjTfs20lhBw5xB0=nEcLEnYWPyaDuPDI5vSSKPWoljoXYbvNriVw0wrEegk=
 
>
> We haven't noticed this problem with earlier Kafka broker versions, so I'm
> guessing maybe this new background heartbeat thread is what introduced the
> deadlock problem.
>
> That same broker is still in the deadlock scenario, we haven't restarted
> it, so let me know if you'd like more info/log/stats from the system 
before
> we restart it.
>
> Thanks,
>
> Marcos Juarez
>




Schema Registry in Staging Environment

2016-09-26 Thread Lawrence Weikum
Hello,

Has anyone used Confluent’s Schema Registry?  If so, I’m curious to hear about 
best practices for using it in a staging environment.

Do users typically copy schemas over to the staging environment from 
production?  Are developers allowed to create new schemas in the staging 
environment?

Thanks!

Lawrence Weikum



Flickering Kafka Topic

2016-09-09 Thread Lawrence Weikum
Hello everyone!

We seem to be experiencing some odd behavior in Kafka and were wondering if 
anyone has come across the same issue and if you’ve been able to fix it.  
Here’s the setup:

8 brokers in the cluster.  Kafka 0.10.0.0.

One topic, and only one topic on this cluster, is having issues where ISRs 
continuously decrease and increase but never stabilize.  This happens after 
roughly 50,000 messages per second come in, and the problem is exacerbated when 
the messages increased to 110,000 messages per second.  Messages are small. 
Total inbound is only about 50 MB/s.

There’s no errors in the logs. We just get countless number of messages like 
theses in the logs:

[2016-09-09 12:54:07,147] INFO Partition [topic_a,11] on broker 4: Expanding 
ISR for partition [topic_a,11] from 4 to 4,2 (kafka.cluster.Partition)
[2016-09-09 12:54:23,070] INFO Partition [topic_a,11] on broker 4: Shrinking 
ISR for partition [topic_a,11] from 4,2 to 4 (kafka.cluster.Partition)

This topic has transient data that is unimportant after 20 minutes, so losing 
some due to a cluster shutdown isn’t that important, and we also don’t mind if 
messages are occasionally dropped.  With this in mind we have these settings:
Partitions = 16
Producer ACKs = 1
Replication factor = 2
min.insync.replicas = 1

CPU is sitting fairly idle at ~18%, and a thread dump and profile showed that 
most threads are sitting idle as well – very little contention if any.

We tried to increase the number of partitions from 16 to 24, but it seems to 
have only grown the CPU (from 18% to 23%) and the number of Under Replicated 
Partitions.

Any advice or insight is appreciated. Thank you all!

Lawrence



0.10 Metrics Reporter Suggestions

2016-08-22 Thread Lawrence Weikum
Hello,

We’re looking into the functionality of Metrics Reporters for producers and 
consumers in Kafka 0.10.  Are there any projects that can be recommended that 
seem promising; specifically involving sending metrics to either StatsD or 
Graphite?

As always, thank you for your help!

Lawrence Weikum



Re: Consumer Offsets and Open FDs

2016-07-18 Thread Lawrence Weikum
It seems that the log-cleaner is still failing no matter what settings I give 
it.  

Here is the full output from one of our brokers:
[2016-07-18 13:00:40,726] ERROR [kafka-log-cleaner-thread-0], Error due to  
(kafka.log.LogCleaner)
java.lang.IllegalArgumentException: requirement failed: 192053210 messages in 
segment __consumer_offsets-15/.log but offset map can fit 
only 7499. You can increase log.cleaner.dedupe.buffer.size or decrease 
log.cleaner.threads
at scala.Predef$.require(Predef.scala:219)
at 
kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply(LogCleaner.scala:584)
at 
kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply(LogCleaner.scala:580)
at 
scala.collection.immutable.Stream$StreamWithFilter.foreach(Stream.scala:570)
at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:580)
at kafka.log.Cleaner.clean(LogCleaner.scala:322)
at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:230)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:208)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2016-07-18 13:00:40,732] INFO [kafka-log-cleaner-thread-0], Stopped  
(kafka.log.LogCleaner)

Currently, I have heap allocation up to 64GB, only one log-cleaning thread is 
set to run, and log.cleaner.dedupe.buffer.size is 2GB.  I get this error if I 
try to increase it anymore:

WARN [kafka-log-cleaner-thread-0], Cannot use more than 2G of cleaner buffer 
space per cleaner thread, ignoring excess buffer space... (kafka.log.LogCleaner)

Is there something else I can do to help the broker compact the 
__consumer_offset topics?

Thank you again for your help!

Lawrence Weikum 

On 7/13/16, 1:06 PM, "Rakesh Vidyadharan" <rvidyadha...@gracenote.com> wrote:

We ran into this as well, and I ended up with the following that works for us.

log.cleaner.dedupe.buffer.size=536870912
log.cleaner.io.buffer.size=2000





On 13/07/2016 14:01, "Lawrence Weikum" <lwei...@pandora.com> wrote:

>Apologies. Here is the full trace from a broker:
>
>[2016-06-24 09:57:39,881] ERROR [kafka-log-cleaner-thread-0], Error due to  
>(kafka.log.LogCleaner)
>java.lang.IllegalArgumentException: requirement failed: 9730197928 messages in 
>segment __consumer_offsets-36/.log but offset map can fit 
>only 5033164. You can increase log.cleaner.dedupe.buffer.size or decrease 
>log.cleaner.threads
>at scala.Predef$.require(Predef.scala:219)
>at 
> kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply(LogCleaner.scala:584)
>at 
> kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply(LogCleaner.scala:580)
>at 
> scala.collection.immutable.Stream$StreamWithFilter.foreach(Stream.scala:570)
>at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:580)
>at kafka.log.Cleaner.clean(LogCleaner.scala:322)
>at 
> kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:230)
>at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:208)
>at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
>[2016-06-24 09:57:39,881] INFO [kafka-log-cleaner-thread-0], Stopped  
>(kafka.log.LogCleaner)
>
>
>Is log.cleaner.dedupe.buffer.size a broker setting?  What is a good number to 
>set it to?
>
>
>
>Lawrence Weikum 
>
>
>On 7/13/16, 11:18 AM, "Manikumar Reddy" <manikumar.re...@gmail.com> wrote:
>
>Can you post the complete error stack trace?   Yes, you need to
>restart the affected
>brokers.
>You can tweak log.cleaner.dedupe.buffer.size, log.cleaner.io.buffer.size
>configs.
>
>Some related JIRAs:
>
>https://issues.apache.org/jira/browse/KAFKA-3587
>https://issues.apache.org/jira/browse/KAFKA-3894
>https://issues.apache.org/jira/browse/KAFKA-3915
>
>On Wed, Jul 13, 2016 at 10:36 PM, Lawrence Weikum <lwei...@pandora.com>
>wrote:
>
>> Oh interesting. I didn’t know about that log file until now.
>>
>> The only error that has been populated among all brokers showing this
>> behavior is:
>>
>> ERROR [kafka-log-cleaner-thread-0], Error due to  (kafka.log.LogCleaner)
>>
>> Then we see many messages like this:
>>
>> INFO Compaction for partition [__consumer_offsets,30] is resumed
>> (kafka.log.LogCleaner)
>> INFO The cleaning for partition [__consumer_offsets,30] is aborted
>> (kafka.log.LogCleaner)
>>
>> Using Visual VM, I do not see any log-cleaner threads in those brokers.  I
>> do see it in the brokers not showing this behavior though.
>>
>> Any idea why the LogCleaner failed?
>>
>> As a temporary fix, should we restart the affected brokers?
>>
>> Thanks again!
>>
>>
>> Lawrence We

Re: Consumer Offsets and Open FDs

2016-07-13 Thread Lawrence Weikum
Apologies. Here is the full trace from a broker:

[2016-06-24 09:57:39,881] ERROR [kafka-log-cleaner-thread-0], Error due to  
(kafka.log.LogCleaner)
java.lang.IllegalArgumentException: requirement failed: 9730197928 messages in 
segment __consumer_offsets-36/.log but offset map can fit 
only 5033164. You can increase log.cleaner.dedupe.buffer.size or decrease 
log.cleaner.threads
at scala.Predef$.require(Predef.scala:219)
at 
kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply(LogCleaner.scala:584)
at 
kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply(LogCleaner.scala:580)
at 
scala.collection.immutable.Stream$StreamWithFilter.foreach(Stream.scala:570)
at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:580)
at kafka.log.Cleaner.clean(LogCleaner.scala:322)
at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:230)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:208)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2016-06-24 09:57:39,881] INFO [kafka-log-cleaner-thread-0], Stopped  
(kafka.log.LogCleaner)


Is log.cleaner.dedupe.buffer.size a broker setting?  What is a good number to 
set it to?



Lawrence Weikum 


On 7/13/16, 11:18 AM, "Manikumar Reddy" <manikumar.re...@gmail.com> wrote:

Can you post the complete error stack trace?   Yes, you need to
restart the affected
brokers.
You can tweak log.cleaner.dedupe.buffer.size, log.cleaner.io.buffer.size
configs.

Some related JIRAs:

https://issues.apache.org/jira/browse/KAFKA-3587
https://issues.apache.org/jira/browse/KAFKA-3894
https://issues.apache.org/jira/browse/KAFKA-3915

On Wed, Jul 13, 2016 at 10:36 PM, Lawrence Weikum <lwei...@pandora.com>
wrote:

> Oh interesting. I didn’t know about that log file until now.
>
> The only error that has been populated among all brokers showing this
> behavior is:
>
> ERROR [kafka-log-cleaner-thread-0], Error due to  (kafka.log.LogCleaner)
>
> Then we see many messages like this:
>
> INFO Compaction for partition [__consumer_offsets,30] is resumed
> (kafka.log.LogCleaner)
> INFO The cleaning for partition [__consumer_offsets,30] is aborted
> (kafka.log.LogCleaner)
>
> Using Visual VM, I do not see any log-cleaner threads in those brokers.  I
> do see it in the brokers not showing this behavior though.
>
> Any idea why the LogCleaner failed?
>
> As a temporary fix, should we restart the affected brokers?
>
> Thanks again!
>
>
> Lawrence Weikum
>
> On 7/13/16, 10:34 AM, "Manikumar Reddy" <manikumar.re...@gmail.com> wrote:
>
> Hi,
>
> Are you seeing any errors in log-cleaner.log?  The log-cleaner thread can
> crash on certain errors.
>
> Thanks
> Manikumar
>
> On Wed, Jul 13, 2016 at 9:54 PM, Lawrence Weikum <lwei...@pandora.com>
> wrote:
>
> > Hello,
> >
> > We’re seeing a strange behavior in Kafka 0.9.0.1 which occurs about every
> > other week.  I’m curious if others have seen it and know of a solution.
> >
> > Setup and Scenario:
> >
> > -  Brokers initially setup with log compaction turned off
> >
> > -  After 30 days, log compaction was turned on
> >
> > -  At this time, the number of Open FDs was ~ 30K per broker.
> >
> > -  After 2 days, the __consumer_offsets topic was compacted
> > fully.  Open FDs reduced to ~5K per broker.
> >
> > -  Cluster has been under normal load for roughly 7 days.
> >
> > -  At the 7 day mark, __consumer_offsets topic seems to have
> > stopped compacting on two of the brokers, and on those brokers, the FD
> > count is up to ~25K.
> >
> >
> > We have tried rebalancing the partitions before.  The first time, the
> > destination broker had compacted the data fine and open FDs were low. The
> > second time, the destination broker kept the FDs open.
> >
> >
> > In all the broker logs, we’re seeing this messages:
> > INFO [Group Metadata Manager on Broker 8]: Removed 0 expired offsets in 0
> > milliseconds. (kafka.coordinator.GroupMetadataManager)
> >
> > There are only 4 consumers at the moment on the cluster; one topic with
> 92
> > partitions.
> >
> > Is there a reason why log compaction may stop working or why the
> > __consumer_offsets topic would start holding thousands of FDs?
> >
> > Thank you all for your help!
> >
> > Lawrence Weikum
> >
> >
>
>
>




Re: Consumer Offsets and Open FDs

2016-07-13 Thread Lawrence Weikum
Oh interesting. I didn’t know about that log file until now.

The only error that has been populated among all brokers showing this behavior 
is:

ERROR [kafka-log-cleaner-thread-0], Error due to  (kafka.log.LogCleaner)

Then we see many messages like this:

INFO Compaction for partition [__consumer_offsets,30] is resumed 
(kafka.log.LogCleaner)
INFO The cleaning for partition [__consumer_offsets,30] is aborted 
(kafka.log.LogCleaner)

Using Visual VM, I do not see any log-cleaner threads in those brokers.  I do 
see it in the brokers not showing this behavior though.

Any idea why the LogCleaner failed?

As a temporary fix, should we restart the affected brokers?

Thanks again!


Lawrence Weikum 

On 7/13/16, 10:34 AM, "Manikumar Reddy" <manikumar.re...@gmail.com> wrote:

Hi,

Are you seeing any errors in log-cleaner.log?  The log-cleaner thread can
crash on certain errors.

Thanks
Manikumar

On Wed, Jul 13, 2016 at 9:54 PM, Lawrence Weikum <lwei...@pandora.com>
wrote:

> Hello,
>
> We’re seeing a strange behavior in Kafka 0.9.0.1 which occurs about every
> other week.  I’m curious if others have seen it and know of a solution.
>
> Setup and Scenario:
>
> -  Brokers initially setup with log compaction turned off
>
> -  After 30 days, log compaction was turned on
>
> -  At this time, the number of Open FDs was ~ 30K per broker.
>
> -  After 2 days, the __consumer_offsets topic was compacted
> fully.  Open FDs reduced to ~5K per broker.
>
> -  Cluster has been under normal load for roughly 7 days.
>
> -  At the 7 day mark, __consumer_offsets topic seems to have
> stopped compacting on two of the brokers, and on those brokers, the FD
> count is up to ~25K.
>
>
> We have tried rebalancing the partitions before.  The first time, the
> destination broker had compacted the data fine and open FDs were low. The
> second time, the destination broker kept the FDs open.
>
>
> In all the broker logs, we’re seeing this messages:
> INFO [Group Metadata Manager on Broker 8]: Removed 0 expired offsets in 0
> milliseconds. (kafka.coordinator.GroupMetadataManager)
>
> There are only 4 consumers at the moment on the cluster; one topic with 92
> partitions.
>
> Is there a reason why log compaction may stop working or why the
> __consumer_offsets topic would start holding thousands of FDs?
>
> Thank you all for your help!
>
> Lawrence Weikum
>
>




Consumer Offsets and Open FDs

2016-07-13 Thread Lawrence Weikum
Hello,

We’re seeing a strange behavior in Kafka 0.9.0.1 which occurs about every other 
week.  I’m curious if others have seen it and know of a solution.

Setup and Scenario:

-  Brokers initially setup with log compaction turned off

-  After 30 days, log compaction was turned on

-  At this time, the number of Open FDs was ~ 30K per broker.

-  After 2 days, the __consumer_offsets topic was compacted fully.  
Open FDs reduced to ~5K per broker.

-  Cluster has been under normal load for roughly 7 days.

-  At the 7 day mark, __consumer_offsets topic seems to have stopped 
compacting on two of the brokers, and on those brokers, the FD count is up to 
~25K.


We have tried rebalancing the partitions before.  The first time, the 
destination broker had compacted the data fine and open FDs were low. The 
second time, the destination broker kept the FDs open.


In all the broker logs, we’re seeing this messages:
INFO [Group Metadata Manager on Broker 8]: Removed 0 expired offsets in 0 
milliseconds. (kafka.coordinator.GroupMetadataManager)

There are only 4 consumers at the moment on the cluster; one topic with 92 
partitions.

Is there a reason why log compaction may stop working or why the 
__consumer_offsets topic would start holding thousands of FDs?

Thank you all for your help!

Lawrence Weikum



Re: Log Compaction not Deleting on Kafka 0.9.0.1

2016-06-22 Thread Lawrence Weikum
Fascinating.  

We are seeing no errors or warning in the logs after restart.  It appears on 
this broker that the compaction thread is working:

[2016-06-22 10:33:49,179] INFO Rolled new log segment for 
'__consumer_offsets-28' in 1 ms. (kafka.log.Log)
[2016-06-22 10:34:00,968] INFO Deleting segment 0 from log 
__consumer_offsets-28. (kafka.log.Log)
[2016-06-22 10:34:00,970] INFO Deleting index 
/kafka/data/__consumer_offsets-28/.index.deleted 
(kafka.log.OffsetIndex)
[2016-06-22 10:34:00,992] INFO Deleting segment 2148144095 from log 
__consumer_offsets-28. (kafka.log.Log)
[2016-06-22 10:34:00,994] INFO Deleting index 
/kafka/data/__consumer_offsets-28/002148144095.index.deleted 
(kafka.log.OffsetIndex)
[2016-06-22 10:34:01,002] INFO Deleting segment 3189277822 from log 
__consumer_offsets-28. (kafka.log.Log)
[2016-06-22 10:34:01,004] INFO Deleting index 
/kafka/data/__consumer_offsets-28/003189277822.index.deleted 
(kafka.log.OffsetIndex)
[2016-06-22 10:34:02,019] INFO Deleting segment 3190205744 from log 
__consumer_offsets-28. (kafka.log.Log)
[2016-06-22 10:34:02,039] INFO Deleting index 
/kafka/data/__consumer_offsets-28/003190205744.index.deleted 
(kafka.log.OffsetIndex)

We see the “kafka-log-cleaner-thread” in the JMX.  It seems to run about every 
50 seconds.  From the log-cleaner.log file, we see plenty of this  output 
regarding the partition that’s hogging the FDs:

[2016-06-22 10:44:31,845] INFO Cleaner 0: Beginning cleaning of log 
__consumer_offsets-28. (kafka.log.LogCleaner)
[2016-06-22 10:44:31,846] INFO Cleaner 0: Building offset map for 
__consumer_offsets-28... (kafka.log.LogCleaner)
[2016-06-22 10:44:31,878] INFO Cleaner 0: Building offset map for log 
__consumer_offsets-28 for 1 segments in offset range [3204124461, 3205052375). 
(kafka.log.LogCleaner)
[2016-06-22 10:44:32,870] INFO Cleaner 0: Offset map for log 
__consumer_offsets-28 complete. (kafka.log.LogCleaner)
[2016-06-22 10:44:32,871] INFO Cleaner 0: Cleaning log __consumer_offsets-28 
(discarding tombstones prior to Tue Jun 21 10:43:19 PDT 2016)... 
(kafka.log.LogCleaner)
[2016-06-22 10:44:32,871] INFO Cleaner 0: Cleaning segment 0 in log 
__consumer_offsets-28 (last modified Tue Jun 21 22:39:18 PDT 2016) into 0, 
retaining deletes. (kafka.log.LogCleaner)
[2016-06-22 10:44:32,888] INFO Cleaner 0: Swapping in cleaned segment 0 for 
segment(s) 0 in log __consumer_offsets-28. (kafka.log.LogCleaner)
[2016-06-22 10:44:32,889] INFO Cleaner 0: Cleaning segment 2148144095 in log 
__consumer_offsets-28 (last modified Wed Jun 22 10:42:31 PDT 2016) into 
2148144095, retaining deletes. (kafka.log.LogCleaner)
[2016-06-22 10:44:32,889] INFO Cleaner 0: Cleaning segment 3203196540 in log 
__consumer_offsets-28 (last modified Wed Jun 22 10:43:19 PDT 2016) into 
2148144095, retaining deletes. (kafka.log.LogCleaner)
[2016-06-22 10:44:32,905] INFO Cleaner 0: Swapping in cleaned segment 
2148144095 for segment(s) 2148144095,3203196540 in log __consumer_offsets-28. 
(kafka.log.LogCleaner)
[2016-06-22 10:44:32,905] INFO Cleaner 0: Cleaning segment 3204124461 in log 
__consumer_offsets-28 (last modified Wed Jun 22 10:44:21 PDT 2016) into 
3204124461, retaining deletes. (kafka.log.LogCleaner)
[2016-06-22 10:44:33,834] INFO Cleaner 0: Swapping in cleaned segment 
3204124461 for segment(s) 3204124461 in log __consumer_offsets-28. 
(kafka.log.LogCleaner)
[2016-06-22 10:44:33,836] INFO [kafka-log-cleaner-thread-0],
Log cleaner thread 0 cleaned log __consumer_offsets-28 (dirty section = 
[3204124461, 3205052375])
100.0 MB of log processed in 2.0 seconds (50.3 MB/sec).
Indexed 100.0 MB in 1.0 seconds (97.6 Mb/sec, 51.5% of total time)
Buffer utilization: 0.0%
Cleaned 100.0 MB in 1.0 seconds (103.6 Mb/sec, 48.5% of total time)
Start size: 100.0 MB (928,011 messages)
End size: 0.0 MB (97 messages)
100.0% size reduction (100.0% fewer messages)
 (kafka.log.LogCleaner)

But no actual delete messages like a properly-working broker is showing of a 
different partition. 

Lawrence Weikum 


On 6/22/16, 11:28 AM, "Tom Crayford" <tcrayf...@heroku.com> wrote:

kafka-log-cleaner-thread-0



Log Compaction not Deleting on Kafka 0.9.0.1

2016-06-22 Thread Lawrence Weikum
We seem to be having a strange issue with a cluster of ours; specifically with 
the __consumer_offsets topic.

When we brought the cluster online, log compaction was turned off.  Realizing 
our mistake, we turned it on, but only after the topic had over 31,018,699,972 
offsets committed to it.  Log compaction seems to have worked and be working 
properly.  The logs are showing that every partition has been compacted, and 
may pieces have been marked for deletion.

The problem is that not all partitions are having their older logs deleted.  
Some partitions will grow to having 19 log files, but a few seconds later will 
have only 13.  One partition in particular, though, still has all of its log 
files, all 19,000 of them, and this never seems to change, only grow as new 
offsets come in.

Restarting that broker doesn’t seem to help.


We’ve checked the broker settings on everything as well.

log.cleaner.enable = true
log.cleanup.policy = delete
cleanup.policy = compact


Has anyone encountered this issue before?

Thank you all for the help!

Lawrence Weikum



Re: JVM Optimizations

2016-06-09 Thread Lawrence Weikum
Hi Tom,

Currently we’re using the default settings – no special tuning whatsoever.  I 
think the kafka-run-class.sh has this:


# Memory options
if [ -z "$KAFKA_HEAP_OPTS" ]; then
  KAFKA_HEAP_OPTS="-Xmx256M"
fi

# JVM performance options
if [ -z "$KAFKA_JVM_PERFORMANCE_OPTS" ]; then
  KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 
-XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC 
-Djava.awt.headless=true"
fi


Is this the confluent doc you were referring to?  
http://docs.confluent.io/2.0.0/kafka/deployment.html

Thanks!

Lawrence Weikum 


On 6/9/16, 1:32 PM, "Tom Crayford" <tcrayf...@heroku.com> wrote:

>Hi Lawrence,
>
>What JVM options were you using? There's a few pages in the confluent docs
>on JVM tuning iirc. We simply use the G1 and a 4GB Max heap and things work
>well (running many thousands of clusters).
>
>Thanks
>Tom Crayford
>Heroku Kafka
>
>On Thursday, 9 June 2016, Lawrence Weikum <lwei...@pandora.com> wrote:
>
>> Hello all,
>>
>> We’ve been running a benchmark test on a Kafka cluster of ours running
>> 0.9.0.1 – slamming it with messages to see when/if things might break.
>> During our test, we caused two brokers to throw OutOfMemory errors (looks
>> like from the Heap) even though each machine still has 43% of the total
>> memory unused.
>>
>> I’m curious what JVM optimizations are recommended for Kafka brokers?  Or
>> if there aren’t any that are recommended, what are some optimizations
>> others are using to keep the brokers running smoothly?
>>
>> Best,
>>
>> Lawrence Weikum
>>
>>



JVM Optimizations

2016-06-09 Thread Lawrence Weikum
Hello all,

We’ve been running a benchmark test on a Kafka cluster of ours running 0.9.0.1 
– slamming it with messages to see when/if things might break.  During our 
test, we caused two brokers to throw OutOfMemory errors (looks like from the 
Heap) even though each machine still has 43% of the total memory unused.

I’m curious what JVM optimizations are recommended for Kafka brokers?  Or if 
there aren’t any that are recommended, what are some optimizations others are 
using to keep the brokers running smoothly?

Best,

Lawrence Weikum



SchemaException error on 0.10.0.0-snapshot

2016-05-18 Thread Lawrence Weikum
Setting up a basic word count job, similar to the one from 
confluent-2.1.0-alpha1 (WordCountJob).  Mine does the same thing essentially, 
just within our namespace.

Coming across this error though, which is strange because the code for the 
moment is identical with the exception of the topic names.

Exception in thread "StreamThread-1" 
org.apache.kafka.common.protocol.types.SchemaException: Error reading field 
'topic_metadata': Error reading array of size 1731681, only 54 bytes available
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:75)
at org.apache.kafka.clients.NetworkClient.parseResponse(NetworkClient.java:380)
at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:449)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:179)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:973)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937)
at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:295)
at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:218)


Is there something I'm missing?

Thanks for the help.

Lawrence


Begin full output:


/Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home/bin/java 
-Didea.launcher.port=7539 "-Didea.launcher.bin.path=/Applications/IntelliJ 
IDEA.app/Contents/bin" -Dfile.encoding=UTF-8 -classpath 

Topic retention configuration clarification

2016-05-02 Thread Lawrence Weikum
Using 0.9.0.1.

I'm building a new topic that should keep data for much longer than the 
brokers'  default, say at least a year, before deleting messages.  
http://kafka.apache.org/documentation.html says setting the "retention.ms" for 
the topic will adjust the time, but I cannot find out what unit of time Kafka 
uses for this.  "ms" would suggest "milliseconds", so a year would be about 
3.154e+13 milliseconds.  This seems like an uncomforatably-high number to give.

Can anyone else confirm this time unit for "retention.ms" for the topic config 
is in milliseconds?  Is there also a "retention.minutes" that's just 
undocumented?

Thanks!


Lawrence Weikum


Default behavior for full broker

2016-04-20 Thread Lawrence Weikum
Hello,

I'm curious about the expected or default behavior that might occur if a broker 
in the system has filled up.  By that I mean when a broker has used all of its 
memory and disk space.  Is the node simply removed from the system until space 
is cleared?

As I'm thinking through this a little more, if that is the case and the node is 
removed, some partitions in the system may be marked as under-replicated and 
cause a cascading effect where partitions are re-replicated and cause other 
nodes to fill up.  Has that ever happened?  Does Kafka have a contingency plan 
for such a scenario?

Thank you so much for your insight and all of your hard work!

Lawrence