Re: Cassandra x MySQL Sharded - Insert Comparison

Gustavo Gustavo Tue, 24 Jan 2012 19:42:31 -0800

I was able to make Cassandra beat MySQL MyISAM (~10k inserts/s against 6k
inserts/s) using two physical machines (laptops) - one the client, and the
other one the server, with 50 inserting threads.
I don't know exactly why yet, but the high-level client that I was using to
C# (Aquiles) was taking a lot of CPU. I switched to fluent-cassandra and
things started to go pretty fast. This was the real problem I suspect.
Yep, dual boot is a good idea. I'll give it a try and see if I can push
both datastores forward. But I think the client won't have enough CPU to
handle much more than 50 threads.


/Gustavo

2012/1/24 Maxim Potekhin <potek...@bnl.gov>

>  a) I hate to break it to you, but 6GB x 4 cores != 'high-end machine'.
> It's pretty much middle of the road consumer level these days.
>
> b) Hosting the client and Cassandra on the same node is a Bad Idea. It
> will depend on what exactly the client will do, but in my experience it
> won't work too well in general.
>
> c) Have you considered dual boot, so you can have a "good operating
> system" (as per Cassandra folks) in addition to Windows?
>
> Maxim
>
>
>
> On 1/22/2012 8:22 PM, Gustavo Gustavo wrote:
>
> Ok guys, thank you for the valuable hints you gave me.
> For sure, things will perform much better on a real hardware. But my
> object maybe isn't really to see what't the max throughput that the
> datastores have. It is more or less like, given an equal condition, which
> one would perform better.
> But I'll do this way, I'm going to use a high-end machine (6GB RAM, 4
> cores) and run Cassandra, MySQL and the Client Test Application on the same
> machine. Unfortunately, I'll have to use Windows 7 as a host to the
> datastores.
> >From your experience, do you think that even in single node, can
> Cassandra beat in inserts a RDBMS? I've seen that InnoDB (something that
> compares to the other databases relational engine) is pretty slow. But when
> it comes to MyISAM, things are much faster.
>
> /Gustavo
>
> 2012/1/22 Chris Gerken <chrisger...@mindspring.com>
>
>> Edward (and Maxim),
>>
>>  I agree.  I was just recalling previous performance bake-offs (for
>> other technologies, long time ago, galaxy far far away) in which the
>> customer had put together a mockup of the high throughput expected in
>> production and wanted to make a decision against that one set of numbers.
>>  We always found that both/all competing products could be made to run
>> faster due to unexpected factors in the non-production test build.  For our
>> side, we always started simple and built up the throughput until we found a
>> bottleneck.  We fixed the bottleneck. Rinse and repeat.
>>
>>   Chris Gerken
>>
>>  chrisger...@mindspring.com
>> 512.587.5261
>> http://www.linkedin.com/in/chgerken
>>
>>
>>
>>  On Jan 22, 2012, at 8:51 AM, Edward Capriolo wrote:
>>
>> In some sense 1 for one performance "almost" does not matter. Thou I bet
>> you can get Cassandra better (I remember old school ycsb white paper
>> benches against a sharded mysql).
>>
>> One of the main bullet points of Cassandra is if you want to grow from 4
>> nodes, to 8 nodes, to 14 nodes, and so on, Cassandra is elastic and
>> supports online adding and removing of nodes. A do-it-yourself hash mod
>> this algorithm really has no upgrade path
>>
>> Edward
>>
>> On Sun, Jan 22, 2012 at 9:26 AM, Chris Gerken <chrisger...@mindspring.com
>> > wrote:
>>
>>> Howdy Gustavo,
>>>
>>>  One thing that jumped out at me is your having put two cassandra
>>> images on the same box.  There may be enough CPU and memory for the two
>>> images combined but you may be seeing some other resource not being shared
>>> so nicely - network card bandwidth, for example.
>>>
>>>  More generally, the real question is what the bottleneck is (for both
>>> db's, actually).  Start with Cassandra running in that configuration and
>>> start with one client thread sending one request a second.  Look at the
>>> CPU, network and memory metrics for all boxes (including the client).
>>>  Nothing should be even close to maxing out that that throughout.  Now
>>> incrementally increase one of the test parameters (number of clients or
>>> number of inserts per second) just a bit (say from one transaction to 5)
>>> and note the above metrics.  Keep slowly increasing the test parameters,
>>> one at a time, until one of the metrics maxes out.  That's the bottleneck
>>> you're wondering about.  Fix that and the db, be it Cassandra or MySQL)
>>> will move ahead of the other performance-wise.  Turn your attention to the
>>> other db and repeat.
>>>
>>>   - Chris Gerken
>>>
>>>   On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote:
>>>
>>> Hello,
>>>
>>> I've set up a testing evironment for Cassandra and MySQL, to compare
>>> both, regarding *performance only*. And I must admit that I was expecting
>>> Cassandra to beat MySQL. But I've not seen this happening up to now.
>>> My application/use case is INSERT intensive, since I'm not updating
>>> anything, just inserting all the time.
>>> To compare both I created virtual machines with Ubuntu 11.10, and
>>> installed the latest versions of each datastore. Each VM has 1GB of RAM.
>>> I've used VMs as a way to give both datastores an equal sandbox.
>>> MySQL is set up to work as sharded, with 2 databases, that means that
>>> records are inserted to a specific instance based on key % 2. The engine is
>>> MyISAM (InnoDB was really slow and not really needed to my case). There's a
>>> primary compound key (integer and datetime columns) in this test table.
>>> Let's name the "nodes" MySQL1 and MySQL2.
>>> Cassandra is set up to work with 4 nodes, with keys (tokens) set up to
>>> distribute records evenly across the 4 nodes (nodetool ring reports 25% to
>>> each node), replication factor 1 and RandomPartitioner, the other configs
>>> are left to default. Let's name the nodes Cassandra1, Cassandra2,
>>> Cassandra3 and Cassandra4.
>>>
>>> I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 2
>>> (MySQL) virtual machines, this way:
>>> Machine1: MySQL1, Cassandra1, Cassandra3
>>> Machine2: MySQL2, Cassandra2, Cassandra4
>>> The machines have CPU and RAM enough to host Cassandra Cluster or MySQL
>>> "Cluster" at a time.
>>>
>>> The client test applicatin is running in a third physical machine, with
>>> 8 threads doing inserts. The test application is written in C# (Windows7)
>>> using Aquiles high-level client.
>>>
>>> My use case is a vehicle tracking system. So, let's suppose, from minute
>>> to minute, the vehicle sends its position together with some other GPS data
>>> and vehicle status information. The columns in my Cassandra cluster are
>>> just the DateTime (long value) of a position for a specific vehicle, and
>>> the value is all the other data serialized to binary format. Therefore, my
>>> CF really grows in columns number. So all data is inserted only to one
>>> CF/Table named Positions. The key to Cassandra is the VehicleID and to
>>> MySQL VehicleID + PositionDateTime (MySQL creates an index to this
>>> automatically). Important to note that MySQL threw tons of connection
>>> exceptions, even though, the insert was retried until it got through MySQL.
>>>
>>> My test case was to insert 1k positions for 1k vehicles to 10 days -
>>> which gives 10.000.000 of inserts.
>>>
>>> The final thoughtput that my application had for this scenario was:
>>>
>>> Cassandra x 4
>>> 2012-01-21 11 <2012-01-21%2011>:45:38,044 #6         [Logger.Log] INFO
>>> - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts):
>>> 2012-01-21 11 <2012-01-21%2011>:45:38,082 #6         [Logger.Log] INFO
>>> - >> Total Time: 2:37:03,359
>>> 2012-01-21 11 <2012-01-21%2011>:45:38,085 #6         [Logger.Log] INFO
>>> - >> Throughput: 1061 inserts/s
>>>
>>> And for MySQL x 2
>>> 2012-01-21 14 <2012-01-21%2014>:26:25,197 #6         [Logger.Log] INFO
>>> - >> Inserted 10000 positions for 1000 vehicles (10000000 inserts):
>>> 2012-01-21 14 <2012-01-21%2014>:26:25,250 #6         [Logger.Log] INFO
>>> - >> Total Time: 2:06:25,914
>>> 2012-01-21 14 <2012-01-21%2014>:26:25,263 #6         [Logger.Log] INFO
>>> - >> Throughput: 1318 inserts/s
>>>
>>> Is there something that I'm missing here? Is this excepted? Or the
>>> problem is somewhere else and that's hard to say looking at this
>>> description?
>>>
>>> Cheers,
>>> Gustavo
>>>
>>>
>>>
>>
>>
>
>

Re: Cassandra x MySQL Sharded - Insert Comparison

Reply via email to