I don't have much to add on this, but q: what is version 0.8.2.3? I thought the latest in 0.8 series was 0.8.2.2?
-Dana On Dec 17, 2015 5:56 PM, "Rajiv Kurian" <ra...@signalfx.com> wrote: > Yes we are in the process of upgrading to the new producers. But the > problem seems deeper than a compatibility issue. We have one environment > where the old producers work with the new 0.9 broker. Further when we > reverted our messed up 0.9 environment to 0.8.2.3 the problem with those > topics didn't go away. > > Didn't see any ZK issues on the brokers. There were other topics on the > very same brokers that didn't seem to be affected. > > On Thu, Dec 17, 2015 at 5:46 PM, Jun Rao <j...@confluent.io> wrote: > > > Yes, the new java producer is available in 0.8.2.x and we recommend > people > > use that. > > > > Also, when those producers had the issue, were there any other things > weird > > in the broker (e.g., broker's ZK session expires)? > > > > Thanks, > > > > Jun > > > > On Thu, Dec 17, 2015 at 2:37 PM, Rajiv Kurian <ra...@signalfx.com> > wrote: > > > > > I can't think of anything special about the topics besides the clients > > > being very old (Java wrappers over Scala). > > > > > > I do think it was using ack=0. But my guess is that the logging was > done > > by > > > the Kafka producer thread. My application itself was not getting > > exceptions > > > from Kafka. > > > > > > On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <j...@confluent.io> wrote: > > > > > > > Hmm, anything special with those 3 topics? Also, the broker log shows > > > that > > > > the producer uses ack=0, which means the producer shouldn't get > errors > > > like > > > > leader not found. Could you clarify on the ack used by the producer? > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <ra...@signalfx.com> > > > wrote: > > > > > > > > > The topic which stopped working had clients that were only using > the > > > old > > > > > Java producer that is a wrapper over the Scala producer. Again it > > > seemed > > > > to > > > > > work perfectly in another of our realms where we have the same > > topics, > > > > same > > > > > producers/consumers etc but with less traffic. > > > > > > > > > > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <j...@confluent.io> > wrote: > > > > > > > > > > > Are you using the new java producer? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jun > > > > > > > > > > > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian < > ra...@signalfx.com> > > > > > wrote: > > > > > > > > > > > > > Hi Jun, > > > > > > > Answers inline: > > > > > > > > > > > > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <j...@confluent.io> > > wrote: > > > > > > > > > > > > > > > Rajiv, > > > > > > > > > > > > > > > > Thanks for reporting this. > > > > > > > > > > > > > > > > 1. How did you verify that 3 of the topics are corrupted? Did > > you > > > > use > > > > > > > > DumpLogSegments tool? Also, is there a simple way to > reproduce > > > the > > > > > > > > corruption? > > > > > > > > > > > > > > > No I did not. The only reason I had to believe that was no > > writers > > > > > could > > > > > > > write to the topic. I have actually no idea what the problem > > was. I > > > > saw > > > > > > > very frequent (much more than usual) messages of the form: > > > > > > > INFO [kafka-request-handler-2 ] > > [kafka.server.KafkaApis > > > > > > > ]: [KafkaApi-6] Close connection due to error handling > > > produce > > > > > > > request with correlation id 294218 from client id with ack=0 > > > > > > > and also message of the form: > > > > > > > INFO [kafka-network-thread-9092-0 ] > > > [kafka.network.Processor > > > > > > > ]: Closing socket connection to /some ip > > > > > > > The cluster was actually a critical one so I had no recourse > but > > to > > > > > > revert > > > > > > > the change (which like noted didn't fix things). I didn't have > > > enough > > > > > > time > > > > > > > to debug further. The only way I could fix it with my limited > > Kafka > > > > > > > knowledge was (after reverting) deleting the topic and > recreating > > > it. > > > > > > > I had updated a low priority cluster before that worked just > > fine. > > > > That > > > > > > > gave me the confidence to upgrade this higher priority cluster > > > which > > > > > did > > > > > > > NOT work out. So the only way for me to try to reproduce it is > to > > > try > > > > > > this > > > > > > > on our larger clusters again. But it is critical that we don't > > mess > > > > up > > > > > > this > > > > > > > high priority cluster so I am afraid to try again. > > > > > > > > > > > > > > > 2. As Lance mentioned, if you are using snappy, make sure > that > > > you > > > > > > > include > > > > > > > > the right snappy jar (1.1.1.7). > > > > > > > > > > > > > > > Wonder why I don't see Lance's email in this thread. Either way > > we > > > > are > > > > > > not > > > > > > > using compression of any kind on this topic. > > > > > > > > > > > > > > > 3. For the CPU issue, could you do a bit profiling to see > which > > > > > thread > > > > > > is > > > > > > > > busy and where it's spending time? > > > > > > > > > > > > > > > Since I had to revert I didn't have the time to profile. > > > Intuitively > > > > it > > > > > > > would seem like the high number of client disconnects/errors > and > > > the > > > > > > > increased network usage probably has something to do with the > > high > > > > CPU > > > > > > > (total guess). Again our other (lower traffic) cluster that was > > > > > upgraded > > > > > > > was totally fine so it doesn't seem like it happens all the > time. > > > > > > > > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian < > > > ra...@signalfx.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > We had to revert to 0.8.3 because three of our topics seem > to > > > > have > > > > > > > gotten > > > > > > > > > corrupted during the upgrade. As soon as we did the upgrade > > > > > producers > > > > > > > to > > > > > > > > > the three topics I mentioned stopped being able to do > writes. > > > The > > > > > > > clients > > > > > > > > > complained (occasionally) about leader not found > exceptions. > > We > > > > > > > restarted > > > > > > > > > our clients and brokers but that didn't seem to help. > > Actually > > > > even > > > > > > > after > > > > > > > > > reverting to 0.8.3 these three topics were broken. To fix > it > > we > > > > had > > > > > > to > > > > > > > > stop > > > > > > > > > all clients, delete the topics, create them again and then > > > > restart > > > > > > the > > > > > > > > > clients. > > > > > > > > > > > > > > > > > > I realize this is not a lot of info. I couldn't wait to get > > > more > > > > > > debug > > > > > > > > info > > > > > > > > > because the cluster was actually being used. Has any one > run > > > into > > > > > > > > something > > > > > > > > > like this? Are there any known issues with old > > > > consumers/producers. > > > > > > The > > > > > > > > > topics that got busted had clients writing to them using > the > > > old > > > > > Java > > > > > > > > > wrapper over the Scala producer. > > > > > > > > > > > > > > > > > > Here are the steps I took to upgrade. > > > > > > > > > > > > > > > > > > For each broker: > > > > > > > > > > > > > > > > > > 1. Stop the broker. > > > > > > > > > 2. Restart with the 0.9 broker running with > > > > > > > > > inter.broker.protocol.version=0.8.2.X > > > > > > > > > 3. Wait for under replicated partitions to go down to 0. > > > > > > > > > 4. Go to step 1. > > > > > > > > > Once all the brokers were running the 0.9 code with > > > > > > > > > inter.broker.protocol.version=0.8.2.X we restarted them one > > by > > > > one > > > > > > with > > > > > > > > > inter.broker.protocol.version=0.9.0.0 > > > > > > > > > > > > > > > > > > When reverting I did the following. > > > > > > > > > > > > > > > > > > For each broker. > > > > > > > > > > > > > > > > > > 1. Stop the broker. > > > > > > > > > 2. Restart with the 0.9 broker running with > > > > > > > > > inter.broker.protocol.version=0.8.2.X > > > > > > > > > 3. Wait for under replicated partitions to go down to 0. > > > > > > > > > 4. Go to step 1. > > > > > > > > > > > > > > > > > > Once all the brokers were running 0.9 code with > > > > > > > > > inter.broker.protocol.version=0.8.2.X I restarted them one > > by > > > > one > > > > > > with > > > > > > > > the > > > > > > > > > 0.8.2.3 broker code. This however like I mentioned did not > > fix > > > > the > > > > > > > three > > > > > > > > > broken topics. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian < > > > > ra...@signalfx.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Now that it has been a bit longer, the spikes I was > seeing > > > are > > > > > gone > > > > > > > but > > > > > > > > > > the CPU and network in/out on the three brokers that were > > > > showing > > > > > > the > > > > > > > > > > spikes are still much higher than before the upgrade. > Their > > > > CPUs > > > > > > have > > > > > > > > > > increased from around 1-2% to 12-20%. The network in on > the > > > > same > > > > > > > > brokers > > > > > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The > > network > > > > out > > > > > > has > > > > > > > > gone > > > > > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a > > > > > corresponding > > > > > > > > > > increase in kafka messages in per second or kafka bytes > in > > > per > > > > > > second > > > > > > > > JMX > > > > > > > > > > metrics. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Rajiv > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >