Hey Garry, Thanks for the good advice. I'm definitely going to read the confluent blog post. I actually went back and re-read the section on containers before reading your reply and realized I was mixing between jobs and containers..
Thanks, Michael On Thu, May 21, 2015 at 10:50 PM, Garry Turkington < g.turking...@improvedigital.com> wrote: > Hi, > > The other variable to think about here is the task to container mapping. > Each job will indeed have 1 task per input partition in the underlying > topic but you can then spread those 500 instances across multiple > containers in your Yarn grid: > > > http://samza.apache.org/learn/documentation/0.9/container/samza-container.html > > I'd also suggest thinking about throughput requirements in terms of both > the Kafka and Samza perspectives. Great blog post from one of the Confluent > guys here: > > > http://blog.confluent.io/2015/03/12/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ > > For me I have Kafka brokers with JBOD disks so my starting partition count > was number of brokers * number of disks. I went with that, discovered it > wasn't hitting my needed throughput and had to up that several times. > Currently I have around 160 partitions for the high throughput topics > (700K msgs/sec) on a 5 broker cluster. > > So over-partitioning is a good thing and gives additional throughput and > more flexible growth but if you are ramping that too soon you are likely > requiring growth of your job container count. Look at your throughput > requirements and hardware assets, pick a starting point and test your > assumptions. If you are like me you'll find most of them are wrong. :) > > Garry > > -----Original Message----- > From: Lukas Steiblys [mailto:lu...@doubledutch.me] > Sent: 21 May 2015 20:20 > To: dev@samza.apache.org > Subject: Re: Number of partitions > > Each job will get all the partitions and each task (500 of them) within > the job will get 1 partition. So there will be 500 processes working > through the log. > > I'd try to figure out what your scaling needs are for the next 2-3 years > and then calculate your resource requirements accordingly (how many > parallel executing tasks you would need). If you need to split, it is not > trivial, but doable. > > Lukas > > -----Original Message----- > From: Michael Ravits > Sent: Thursday, May 21, 2015 11:17 AM > To: dev@samza.apache.org > Subject: Re: Number of partitions > > Well, since the number of partitions can't be changed after the system > starts running I wanted to have the flexibility to grow a lot without > stopping for upgrade. > Just wonder what would be a tolerable number for Samza. > For example if I'd start with 5 jobs, each will get 100 partitions. Is > this reasonable? Or too much for a single job instance? > > On Thu, May 21, 2015 at 7:46 PM, Lukas Steiblys <lu...@doubledutch.me> > wrote: > > > 500 is a bit extreme unless you're planning on running the job on some > > 200 machines and try to exploit their full power. I personally run 4 > > in production for our system processing 100 messages/s and there's > > plenty of room to grow. > > > > Lukas > > > > On Thursday, May 21, 2015, Michael Ravits <michaelr...@gmail.com> wrote: > > > > > Hi, > > > > > > I wonder what are the considerations I need to account for in regard > > > to > > the > > > number of partitions in input topics for Samza. > > > When testing with a 500 partitions topic with one Samza job I > > > noticed the start up time to be very long. > > > Are there any problems that might occur when dealing with this > > > number of partitions? > > > > > > Thanks, > > > Michael > > > > > > > > ----- > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2014.0.4800 / Virus Database: 4311/9817 - Release Date: 05/19/15 >