Thanks for the response. Reading through that thread, it appears that this issue was addressed with KAFKA-3810 <https://issues.apache.org/jira/browse/KAFKA-3810>. This change eases the restriction on fetch size between replicas. However, should the outcome be a more comprehensive change to the serialization format of the request? The size of the group metadata currently grows linearly with the number of topic-partitions. This is difficult to tune for in a configuration using topic auto creation.
On Fri, Mar 17, 2017 at 3:17 AM, James Cheng <wushuja...@gmail.com> wrote: > I think it's due to the high number of partitions and the high number of > consumers in the group. The group coordination info to keep track of the > assignments actually happens via a message that travels through the > __consumer_offsets topic. So with so many partitions and consumers, the > message gets too big to go through the topic. > > There is a long thread here that discusses it. I don't remember what > specific actions came out of that discussion. http://search-hadoop.com/m/ > Kafka/uyzND1yd26N1rFtRd1?subj=+DISCUSS+scalability+limits+ > in+the+coordinator > > -James > > Sent from my iPhone > > > On Mar 15, 2017, at 9:40 AM, Robert Quinlivan <rquinli...@signal.co> > wrote: > > > > I should also mention that this error was seen on broker version > 0.10.1.1. > > I found that this condition sounds somewhat similar to KAFKA-4362 > > <https://issues.apache.org/jira/browse/KAFKA-4362>, but that issue was > > submitted in 0.10.1.1 so they appear to be different issues. > > > > On Wed, Mar 15, 2017 at 11:11 AM, Robert Quinlivan <rquinli...@signal.co > > > > wrote: > > > >> Good morning, > >> > >> I'm hoping for some help understanding the expected behavior for an > offset > >> commit request and why this request might fail on the broker. > >> > >> *Context:* > >> > >> For context, my configuration looks like this: > >> > >> - Three brokers > >> - Consumer offsets topic replication factor set to 3 > >> - Auto commit enabled > >> - The user application topic, which I will call "my_topic", has a > >> replication factor of 3 as well and 800 partitions > >> - 4000 consumers attached in consumer group "my_group" > >> > >> > >> *Issue:* > >> > >> When I attach the consumers, the coordinator logs the following error > >> message repeatedly for each generation: > >> > >> ERROR [Group Metadata Manager on Broker 0]: Appending metadata message > for > >> group my_group generation 2066 failed due to org.apache.kafka.common. > >> errors.RecordTooLargeException, returning UNKNOWN error code to the > >> client (kafka.coordinator.GroupMetadataManager) > >> > >> *Observed behavior:* > >> > >> The consumer group does not stay connected long enough to consume > >> messages. It is effectively stuck in a rebalance loop and the "my_topic" > >> data has become unavailable. > >> > >> > >> *Investigation:* > >> > >> Following the Group Metadata Manager code, it looks like the broker is > >> writing to a cache after it writes an Offset Commit Request to the log > >> file. If this cache write fails, the broker then logs this error and > >> returns an error code in the response. In this case, the error from the > >> cache is MESSAGE_TOO_LARGE, which is logged as a > RecordTooLargeException. > >> However, the broker then sets the error code to UNKNOWN on the Offset > >> Commit Response. > >> > >> It seems that the issue is the size of the metadata in the Offset Commit > >> Request. I have the following questions: > >> > >> 1. What is the size limit for this request? Are we exceeding the size > >> which is causing this request to fail? > >> 2. If this is an issue with metadata size, what would cause abnormally > >> large metadata? > >> 3. How is this cache used within the broker? > >> > >> > >> Thanks in advance for any insights you can provide. > >> > >> Regards, > >> Robert Quinlivan > >> Software Engineer, Signal > >> > > > > > > > > -- > > Robert Quinlivan > > Software Engineer, Signal > -- Robert Quinlivan Software Engineer, Signal