The partitions enable scalability. Consumers and produces know which partition records belong in based on their key (or manual assignment), which makes it very easy to scale up your kafka cluster or a consuming cluster.
The ordering problem is one that I have faced, and have a workaround. Just keep in mind that it does limit your ability to scale after a certain point. So first, lets assume that that ordering only maters per producer. This is kind of a standard assumption in most distributed, scaleable, and performant systems. Given that assumption, you can either set the key data of each message to be a producer identifier or use manual partition assignment to ensure per producer ordering. The choice here depends on message volume, and the nature of your producers. Number Producers must be >= number of partitions. On the consumption side, since any consumer subscribes to a full partition, its guaranteed to read message in the same order that any specific producer produced those messages. But there is an implicit coupling to scaling the consumer and scaling the producer count. So if the consumer is slower than the producer you might get in trouble. On 3/11/16, 1:38 AM, "Gerard Klijs" <gerard.kl...@dizzit.com> wrote: >I noticed a similar effect with a test tool, which checked if the order >the >records were produced in, was the same as the order in which they were >consumed. Using only one partition it works fine, but using multiple >partitions the order gets messed up. If I'm right this is by design, but I >would like to hear some feedback about this. Because messages with the >same >key, end up in the same partition, if you have multiple partitions, only >the order within a partition is the same as the order they where produced >in. But when consuming form multiple partitions the order could be >different. > >If this is true it would be interesting what you should do when you have a >topic were the order needs to be kept the same, and needs to be consumed >by >more then one consumer at a time? > >On Fri, Mar 11, 2016 at 5:50 AM Ewen Cheslack-Postava <e...@confluent.io> >wrote: > >> You definitely *might* see data from multiple partitions, and that >>won't be >> uncommon once you start processing data. However, there is no guarantee. >> >> In practice, it may be unlikely to see data for both partitions on the >> first call to poll() for a simple reason: poll() will return as soon as >>any >> data for any partition is available. Unless things are timed just right, >> you're probably making requests to different brokers for data in the >> different partitions. These requests won't be perfectly aligned -- one >>of >> them will get a response first and the poll() will be able to return >>with >> some data. Since only the one response will have been received, only one >> partition will get data. >> >> After the first poll, you probably spend some time processing that data >> before you call poll again. However, another request has been sent out >>to >> the broker that returned data faster and the other request also gets >> returned. So on the next poll, you might be more likely to see data from >> both partitions. >> >> So you're right: there's no hard guarantee, and you shouldn't write your >> consumer code to assume that data will be returned for all partitions. >>(And >> you can't assume that anyway; what if no new data had been published to >>one >> of the partitions?). However, many times you will see data from multiple >> partitions. >> >> -Ewen >> >> On Thu, Mar 10, 2016 at 11:21 AM, Shrijeet Paliwal < >> shrijeet.pali...@gmail.com> wrote: >> >> > Version: 0.9.0.1 >> > >> > I have a test which creates two partitions in a topic, writes data to >> both >> > partitions. Then a single consumer subscribes to the topic, verifies >>that >> > it has got the assignment of both partitions in that topic & finally >> issues >> > a poll. The firs poll always comes back with records of only one >> partition. >> > I need to poll one more time to get records for the second partition. >>The >> > poll timeout has no effect on this. >> > >> > Unless I've misunderstood the contract - the first poll *could* have >> > returned records for the both the partitions. After-all poll >> > returns ConsumerRecords<K,V>, which is a map of topic_partitions --> >> > records >> > >> > I acknowledge that API does not make any hard guarantees that align >>with >> my >> > expectation but looks like API was crafted to support multiple >> partitions >> > & topics in single call. Is there an implementation detail which >> restricts >> > this? Is there a configuration which is controlling what gets fetched? >> > >> > -- >> > Shrijeet >> > >> >> >> >> -- >> Thanks, >> Ewen >> ________________________________ NOTICE: This message, and any attachments, are for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at E-Communication Disclaimer<http://www.cmegroup.com/tools-information/communications/e-communication-disclaimer.html>. If you are not the intended recipient, please delete this message. CME Group and its subsidiaries reserve the right to monitor all email communications that occur on CME Group information systems.