Hi, The other variable to think about here is the task to container mapping. Each job will indeed have 1 task per input partition in the underlying topic but you can then spread those 500 instances across multiple containers in your Yarn grid:
http://samza.apache.org/learn/documentation/0.9/container/samza-container.html I'd also suggest thinking about throughput requirements in terms of both the Kafka and Samza perspectives. Great blog post from one of the Confluent guys here: http://blog.confluent.io/2015/03/12/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ For me I have Kafka brokers with JBOD disks so my starting partition count was number of brokers * number of disks. I went with that, discovered it wasn't hitting my needed throughput and had to up that several times. Currently I have around 160 partitions for the high throughput topics (700K msgs/sec) on a 5 broker cluster. So over-partitioning is a good thing and gives additional throughput and more flexible growth but if you are ramping that too soon you are likely requiring growth of your job container count. Look at your throughput requirements and hardware assets, pick a starting point and test your assumptions. If you are like me you'll find most of them are wrong. :) Garry -----Original Message----- From: Lukas Steiblys [mailto:lu...@doubledutch.me] Sent: 21 May 2015 20:20 To: dev@samza.apache.org Subject: Re: Number of partitions Each job will get all the partitions and each task (500 of them) within the job will get 1 partition. So there will be 500 processes working through the log. I'd try to figure out what your scaling needs are for the next 2-3 years and then calculate your resource requirements accordingly (how many parallel executing tasks you would need). If you need to split, it is not trivial, but doable. Lukas -----Original Message----- From: Michael Ravits Sent: Thursday, May 21, 2015 11:17 AM To: dev@samza.apache.org Subject: Re: Number of partitions Well, since the number of partitions can't be changed after the system starts running I wanted to have the flexibility to grow a lot without stopping for upgrade. Just wonder what would be a tolerable number for Samza. For example if I'd start with 5 jobs, each will get 100 partitions. Is this reasonable? Or too much for a single job instance? On Thu, May 21, 2015 at 7:46 PM, Lukas Steiblys <lu...@doubledutch.me> wrote: > 500 is a bit extreme unless you're planning on running the job on some > 200 machines and try to exploit their full power. I personally run 4 > in production for our system processing 100 messages/s and there's > plenty of room to grow. > > Lukas > > On Thursday, May 21, 2015, Michael Ravits <michaelr...@gmail.com> wrote: > > > Hi, > > > > I wonder what are the considerations I need to account for in regard > > to > the > > number of partitions in input topics for Samza. > > When testing with a 500 partitions topic with one Samza job I > > noticed the start up time to be very long. > > Are there any problems that might occur when dealing with this > > number of partitions? > > > > Thanks, > > Michael > > > ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4800 / Virus Database: 4311/9817 - Release Date: 05/19/15