As a test, why not just use a disk with provisioned IOPs of 4000? Just as a 
test - see if it improves.

Also, you have not supplied any metrics regarding the VM's performance. Is the 
CPU busy? Is IO maxed out? Network? Disk? Use a tool like atop, and tell us 
what you find. 

Philip

On May 20, 2013, at 6:43 PM, Ken Krugler <kkrugler_li...@transpac.com> wrote:

> Hi Jason,
> 
> On May 20, 2013, at 10:01am, Jason Weiss wrote:
> 
>> Hi Scott.
>> 
>> I'm using Kafka 0.7.2. I am using the default replication factor, since I
>> don't recall changing that configuration at all.
>> 
>> I'm using provisioned IOPS, which from attending the AWS event in NYC a
>> few weeks ago was presented as the "fastest storage option" for EC2. A
>> number of partners presented success stories in terms of throughput with
>> provisioned IOPS. I've tried to follow that model.
> 
> In my experience directly hitting an ephemeral drive on m1.large is faster 
> than using EBS.
> 
> I've seen some articles where RAIDing multiple EBS volumes can exceed the 
> performance of ephemeral drives, but with high variability.
> 
> If you want to maximize performance, set up up a (smaller) cluster of 
> SSD-backed instances with 10Gb Ethernet in the same cluster group.
> 
> E.g. test with three cr1.8xlarge instances.
> 
> -- Ken
> 
> 
>> On 5/20/13 12:56 PM, "Scott Clasen" <sc...@heroku.com> wrote:
>> 
>>> My guess, EBS is likely your bottleneck.  Try running on instance local
>>> disks, and compare your results.  Is this 0.8? What replication factor are
>>> you using?
>>> 
>>> 
>>> On Mon, May 20, 2013 at 8:11 AM, Jason Weiss <jason_we...@rapid7.com>
>>> wrote:
>>> 
>>>> I'm trying to maximize my throughput and seem to have hit a ceiling.
>>>> Everything described below is running in AWS.
>>>> 
>>>> I have configured a Kafka cluster with 5 machines, M1.Large, with 600
>>>> provisioned IOPS storage for each EC2 instance. I have a Zookeeper
>>>> server
>>>> (we aren't in production yet, so I didn't take the time to setup a ZK
>>>> cluster). Publishing to a single topic from 7 different clients, I seem
>>>> to
>>>> max out at around 20,000 eps with a fixed 2K message size. Each brokers
>>>> defines 10 file segments, with a 25000 message / 5 second flush
>>>> configuration in server.properties. I have stuck with 8 threads. My
>>>> producers (Java) are configured with batch.num.messages at 50, and
>>>> queue.buffering.max.messages at 100.
>>>> 
>>>> When I went from 4 servers in the cluster to 5 servers, I only saw an
>>>> increase of about 500 events per second in throughput. In sharp
>>>> contrast,
>>>> when I run a complete environment on my MacBook Pro, tuned as described
>>>> above but with a single ZK and a single Kafka broker, I am seeing 61,000
>>>> events per second. I don't think I'm network constrained in the AWS
>>>> environment (producer side) because when I add one more client, my
>>>> MacBook
>>>> Pro, I see a proportionate decrease in EC2 client throughput, and the
>>>> net
>>>> result is an identical 20,000 eps. Stated differently, my EC2 instance
>>>> give
>>>> up throughput when my local MacBook Pro joins the array of producers
>>>> such
>>>> that the throughput is exactly the same.
>>>> 
>>>> Does anyone have any additional suggestions on what else I could tune to
>>>> try and hit our goal, 50,000 eps with a 5 machine cluster? Based on the
>>>> whitepapers published, LinkedIn describes a peak of 170,000 events per
>>>> second across their cluster. My 20,000 seems so far away from their
>>>> production figures.
>>>> 
>>>> What is the relationship, in terms of performance, between ZK and Kafka?
>>>> Do I need to have a more performant ZK cluster, the same, or does it
>>>> really
>>>> not matter in terms of maximizing throughput.
>>>> 
>>>> Thanks for any suggestions ­ I've been pulling knobs and turning levers
>>>> on
>>>> this for several days now.
>>>> 
>>>> 
>>>> Jason
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 

Reply via email to