Hi,

The other variable to think about here is the task to container mapping. Each 
job will indeed have 1 task per input partition in the underlying topic but you 
can then spread those 500 instances across multiple containers in your Yarn 
grid:

http://samza.apache.org/learn/documentation/0.9/container/samza-container.html

I'd also suggest thinking about throughput requirements in terms of both the 
Kafka and Samza perspectives. Great blog post from one of the Confluent guys 
here:

http://blog.confluent.io/2015/03/12/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/

For me I have Kafka brokers with JBOD disks so my starting partition count was 
number of brokers * number of disks. I went with that, discovered it wasn't 
hitting my needed throughput and had to up that several times. Currently I have 
around 160 partitions for the  high throughput topics (700K msgs/sec) on a 5 
broker cluster.

So over-partitioning is a good thing and gives additional throughput and more 
flexible growth but if you are ramping that too soon you are likely requiring 
growth of your job container count. Look at your throughput requirements and 
hardware assets, pick a starting point and test your assumptions. If you are  
like me you'll find most of them are wrong. :)

Garry

-----Original Message-----
From: Lukas Steiblys [mailto:lu...@doubledutch.me] 
Sent: 21 May 2015 20:20
To: dev@samza.apache.org
Subject: Re: Number of partitions

Each job will get all the partitions and each task (500 of them) within the job 
will get 1 partition. So there will be 500 processes working through the log.

I'd try to figure out what your scaling needs are for the next 2-3 years and 
then calculate your resource requirements accordingly (how many parallel 
executing tasks you would need). If you need to split, it is not trivial, but 
doable.

Lukas

-----Original Message-----
From: Michael Ravits
Sent: Thursday, May 21, 2015 11:17 AM
To: dev@samza.apache.org
Subject: Re: Number of partitions

Well, since the number of partitions can't be changed after the system starts 
running I wanted to have the flexibility to grow a lot without stopping for 
upgrade.
Just wonder what would be a tolerable number for Samza.
For example if I'd start with 5 jobs, each will get 100 partitions. Is this 
reasonable? Or too much for a single job instance?

On Thu, May 21, 2015 at 7:46 PM, Lukas Steiblys <lu...@doubledutch.me>
wrote:

> 500 is a bit extreme unless you're planning on running the job on some 
> 200 machines and try to exploit their full power. I personally run 4 
> in production for our system processing 100 messages/s and there's 
> plenty of room to grow.
>
> Lukas
>
> On Thursday, May 21, 2015, Michael Ravits <michaelr...@gmail.com> wrote:
>
> > Hi,
> >
> > I wonder what are the considerations I need to account for in regard 
> > to
> the
> > number of partitions in input topics for Samza.
> > When testing with a 500 partitions topic with one Samza job I 
> > noticed the start up time to be very long.
> > Are there any problems that might occur when dealing with this 
> > number of partitions?
> >
> > Thanks,
> > Michael
> >
> 


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4800 / Virus Database: 4311/9817 - Release Date: 05/19/15

Reply via email to