Re: Overwhelming a cluster with writes?

Benjamin Black Mon, 05 Apr 2010 23:54:17 -0700

You are blowing away the mostly saner JVM_OPTS running it that way.
Edit cassandra.in.sh (or wherever config is on your system) to
increase mx to 4G (not 6G, for now) and leave everything else
untouched and do not specify JVM_OPTS on the command line.  See if you
get the same behavior.



b

On Mon, Apr 5, 2010 at 11:48 PM, Ilya Maykov <ivmay...@gmail.com> wrote:
> No, the disks on all nodes have about 750GB free space. Also as
> mentioned in my follow-up email, writing with ConsistencyLevel.ALL
> makes the slowdowns / crashes go away.
>
> -- Ilya
>
> On Mon, Apr 5, 2010 at 11:46 PM, Ran Tavory <ran...@gmail.com> wrote:
>> Do you see one of the disks used by cassandra filled up when a node crashes?
>>
>> On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov <ivmay...@gmail.com> wrote:
>>>
>>> I'm running the nodes with a JVM heap size of 6GB, and here are the
>>> related options from my storage-conf.xml. As mentioned in the first
>>> email, I left everything at the default value. I briefly googled
>>> around for "Cassandra performance tuning" etc but haven't found a
>>> definitive guide ... any help with tuning these parameters is greatly
>>> appreciated!
>>>
>>>  <DiskAccessMode>auto</DiskAccessMode>
>>>  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
>>>  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
>>>  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
>>>  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
>>>  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
>>>  <MemtableThroughputInMB>64</MemtableThroughputInMB>
>>>  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
>>>  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
>>>  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
>>>  <ConcurrentReads>8</ConcurrentReads>
>>>  <ConcurrentWrites>64</ConcurrentWrites>
>>>  <CommitLogSync>periodic</CommitLogSync>
>>>  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
>>>  <GCGraceSeconds>864000</GCGraceSeconds>
>>>
>>> -- Ilya
>>>
>>> On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman <shulm...@gmail.com> wrote:
>>> > You are running out of memory on your nodes. Before the final crash
>>> > your nodes are probably slow  due to GC. What is your memtable size?
>>> > What cache options did you configure?
>>> >
>>> > On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov <ivmay...@gmail.com> wrote:
>>> >> Hi all,
>>> >>
>>> >> I've just started experimenting with Cassandra to get a feel for the
>>> >> system. I've set up a test cluster and to get a ballpark idea of its
>>> >> performance I wrote a simple tool to load some toy data into the
>>> >> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
>>> >> writes from a single client. I'm trying to figure out if this is a
>>> >> problem with my setup, if I'm hitting bugs in the Cassandra codebase,
>>> >> or if this is intended behavior. Sorry this email is kind of long,
>>> >> here is the TLDR version:
>>> >>
>>> >> While writing to Cassandra from a single node, I am able to get the
>>> >> cluster into a bad state, where nodes are randomly disconnecting from
>>> >> each other, write performance plummets, and sometimes nodes even
>>> >> crash. Further, the nodes do not recover as long as the writes
>>> >> continue (even at a much lower rate), and sometimes do not recover at
>>> >> all unless I restart them. I can get this to happen simply by throwing
>>> >> data at the cluster fast enough, and I'm wondering if this is a known
>>> >> issue or if I need to tweak my setup.
>>> >>
>>> >> Now, the details.
>>> >>
>>> >> First, a little bit about the setup:
>>> >>
>>> >> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
>>> >> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
>>> >> in. Node specs:
>>> >> 8-core Intel Xeon e5...@2.00ghz
>>> >> 8GB RAM
>>> >> 1Gbit ethernet
>>> >> Red Hat Linux 2.6.18
>>> >> JVM 1.6.0_19 64-bit
>>> >> 1TB spinning disk houses both commitlog and data directories (which I
>>> >> know is not ideal).
>>> >> The client machine is on the same local network and has very similar
>>> >> specs.
>>> >>
>>> >> The cassandra nodes are started with the following JVM options:
>>> >>
>>> >> ./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
>>> >> -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"
>>> >>
>>> >> I'm using default settings for all of the tunable stuff at the bottom
>>> >> of storage-conf.xml. I also selected my initial tokens to evenly
>>> >> partition the key space when the cluster was bootstrapped. I am using
>>> >> the RandomPartitioner.
>>> >>
>>> >> Now, about the test. Basically I am trying to get an idea of just how
>>> >> fast I can make this thing go. I am writing ~250M data records into
>>> >> the cluster, replicated at 3x, using Ran Tavory's Hector client
>>> >> (Java), writing with ConsistencyLevel.ZERO and
>>> >> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
>>> >> threads talking to each of the 4 nodes in the cluster. Records are
>>> >> identified by a numeric id, and I'm writing them in batches of up to
>>> >> 10k records per row, with each record in its own column. The row key
>>> >> identifies the bucket into which records fall. So, records with ids 0
>>> >> - 9999 are written to row "0", 10000 - 19999 are written to row
>>> >> "10000", etc. Each record is a JSON object with ~10-20 fields.
>>> >>
>>> >> Records: {  // Column Family
>>> >>   0 : {  // row key for the start of the bucket. Buckets span a range
>>> >> of up to 10000 records
>>> >>     1 : "{ /* some JSON */ }",  // Column for record with id=1
>>> >>     3 : "{ /* some more JSON */ }",  // Column for record with id=3
>>> >>    ...
>>> >>    9999 : "{ /* ... */ }"
>>> >>   },
>>> >>  10000 : {  // row key for the start of the next bucket
>>> >>    10001 : ...
>>> >>    10004 :
>>> >> }
>>> >>
>>> >> I am reading the data out of a local, sorted file on the client, so I
>>> >> only write a row to Cassandra once all records for that row have been
>>> >> read, and each row is written to exactly once. I'm using a
>>> >> producer-consumer queue to pump data from the input reader thread to
>>> >> the output writer threads. I found that I have to throttle the reader
>>> >> thread heavily in order to get good behavior. So, if I make the reader
>>> >> sleep for 7 seconds every 1M records, everything is fine - the data
>>> >> loads in about an hour, half of which is spent by the reader thread
>>> >> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
>>> >> client's network interface while the reader is not sleeping, and it
>>> >> takes ~7-8 seconds to write each batch of 1M records.
>>> >>
>>> >> Now, if I remove the 7 second sleeps on the client side, things get
>>> >> bad after the first ~8M records are written to the client. Write
>>> >> throughput drops to <5 MB/s. I start seeing messages about nodes
>>> >> disconnecting and reconnecting in Cassandra's system.log, as well as
>>> >> lots of GC messages:
>>> >>
>>> >> ...
>>> >>  INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
>>> >> InetAddress /10.15.38.88 is now dead.
>>> >>  INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
>>> >> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
>>> >> 1035998648 used; max is 1211170816
>>> >>  INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
>>> >> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
>>> >> 1066120952 used; max is 1211170816
>>> >>  INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
>>> >> InetAddress /10.15.38.55 is now dead.
>>> >>  INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
>>> >> InetAddress /10.15.38.55 is now UP
>>> >>  INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
>>> >> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
>>> >> 1086023832 used; max is 1211170816
>>> >>  INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
>>> >> InetAddress /10.15.38.242 is now dead.
>>> >>  INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
>>> >> InetAddress /10.15.38.55 is now dead.
>>> >>  INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
>>> >> InetAddress /10.15.38.55 is now UP
>>> >>  INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
>>> >> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
>>> >> 1051620856 used; max is 1211170816
>>> >> ...
>>> >>
>>> >> Finally followed by this and some/all nodes going down:
>>> >>
>>> >> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
>>> >> DebuggableThreadPoolExecutor.java (line 94) Error in executor
>>> >> futuretask
>>> >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
>>> >> Java heap space
>>> >>        at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
>>> >>        at java.util.concurrent.FutureTask.get(Unknown Source)
>>> >>        at
>>> >> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
>>> >>        at
>>> >> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
>>> >>        at
>>> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>>> >>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>>> >> Source)
>>> >>        at java.lang.Thread.run(Unknown Source)
>>> >> Caused by: java.lang.OutOfMemoryError: Java heap space
>>> >>        at java.util.Arrays.copyOf(Unknown Source)
>>> >>        at java.io.ByteArrayOutputStream.write(Unknown Source)
>>> >>        at java.io.DataOutputStream.write(Unknown Source)
>>> >>        at
>>> >> org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
>>> >>        at
>>> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
>>> >>        at
>>> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
>>> >>        at
>>> >> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>>> >>        at
>>> >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>>> >>        at
>>> >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>>> >>        at
>>> >> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>>> >>        at
>>> >> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>>> >>        at
>>> >> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
>>> >>        at
>>> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
>>> >>        at
>>> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
>>> >>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>>> >>        at java.util.concurrent.FutureTask.run(Unknown Source)
>>> >>        ... 3 more
>>> >>
>>> >> At first I thought that with ConsistencyLevel.ZERO I must be doing
>>> >> async writes so Cassandra can't push back on the client threads (by
>>> >> blocking them), thus the server is getting overwhelmed. But, I would
>>> >> expect it to start dropping data and not crash in that case (after
>>> >> all, I did say ZERO so I can't expect any reliability, right?).
>>> >> However, I see similar slowdown / node dropout behavior when I set the
>>> >> consistency level to ONE. Does Cassandra push back on writers under
>>> >> heavy load? Is there some magic setting I need to tune to have it not
>>> >> fall over? Do I just need a bigger cluster? Thanks in advance,
>>> >>
>>> >> -- Ilya
>>> >>
>>> >> P.S. I realize that it's still handling a LOT of data with just 4
>>> >> nodes, and in practice nobody would run a system that gets 125k writes
>>> >> per second on top of a 4 node cluster. I was just surprised that I
>>> >> could make Cassandra fall over at all using a single client that's
>>> >> pumping data at 40-50 MB/s.
>>> >>
>>> >
>>
>>
>

Re: Overwhelming a cluster with writes?

Reply via email to