Logged batch.
On Fri, Jun 20, 2014 at 2:13 PM, DuyHai Doan <doanduy...@gmail.com> wrote: > I think some figures from "nodetool tpstats" and "nodetool > compactionstats" may help seeing clearer > > And Pavel, when you said batch, did you mean LOGGED batch or UNLOGGED > batch ? > > > > > > On Fri, Jun 20, 2014 at 8:02 PM, Marcelo Elias Del Valle < > marc...@s1mbi0se.com.br> wrote: > >> If you have 32 Gb RAM, the heap is probably 8Gb. >> 200 writes of 100 kb / s would be 20MB / s in the worst case, supposing >> all writes of a replica goes to a single node. >> I really don't see any reason why it should be filling up the heap. >> Anyone else? >> >> But did you check the logs for the GCInspector? >> In my case, nodes are falling because of the heap, in your case, maybe >> it's something else. >> Do you see increased times when looking for GCInspector in the logs? >> >> []s >> >> >> >> 2014-06-20 14:51 GMT-03:00 Pavel Kogan <pavel.ko...@cortica.com>: >> >> Hi Marcelo, >>> >>> No pending write tasks, I am writing a lot, about 100-200 writes each up >>> to 100Kb every 15[s]. >>> It is running on decent cluster of 5 identical nodes, quad cores i7 with >>> 32Gb RAM and 480Gb SSD. >>> >>> Regards, >>> Pavel >>> >>> >>> On Fri, Jun 20, 2014 at 12:31 PM, Marcelo Elias Del Valle < >>> marc...@s1mbi0se.com.br> wrote: >>> >>>> Pavel, >>>> >>>> In my case, the heap was filling up faster than it was draining. I am >>>> still looking for the cause of it, as I could drain really fast with SSD. >>>> >>>> However, in your case you could check (AFAIK) nodetool tpstats and see >>>> if there are too many pending write tasks, for instance. Maybe you really >>>> are writting more than the nodes are able to flush to disk. >>>> >>>> How many writes per second are you achieving? >>>> >>>> Also, I would look for GCInspector in the log: >>>> >>>> cat system.log* | grep GCInspector | wc -l >>>> tail -1000 system.log | grep GCInspector >>>> >>>> Do you see it running a lot? Is it taking much more time to run each >>>> time it runs? >>>> >>>> I am no Cassandra expert, but I would try these things first and post >>>> the results here. Maybe other people in the list have more ideas. >>>> >>>> Best regards, >>>> Marcelo. >>>> >>>> >>>> 2014-06-20 8:50 GMT-03:00 Pavel Kogan <pavel.ko...@cortica.com>: >>>> >>>> The cluster is new, so no updates were done. Version 2.0.8. >>>>> It happened when I did many writes (no reads). Writes are done in >>>>> small batches of 2 inserts (writing to 2 column families). The values are >>>>> big blobs (up to 100Kb). >>>>> >>>>> Any clues? >>>>> >>>>> Pavel >>>>> >>>>> >>>>> On Thu, Jun 19, 2014 at 8:07 PM, Marcelo Elias Del Valle < >>>>> marc...@s1mbi0se.com.br> wrote: >>>>> >>>>>> Pavel, >>>>>> >>>>>> Out of curiosity, did it start to happen before some update? Which >>>>>> version of Cassandra are you using? >>>>>> >>>>>> []s >>>>>> >>>>>> >>>>>> 2014-06-19 16:10 GMT-03:00 Pavel Kogan <pavel.ko...@cortica.com>: >>>>>> >>>>>>> What a coincidence! Today happened in my cluster of 7 nodes as well. >>>>>>> >>>>>>> Regards, >>>>>>> Pavel >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 18, 2014 at 11:13 AM, Marcelo Elias Del Valle < >>>>>>> marc...@s1mbi0se.com.br> wrote: >>>>>>> >>>>>>>> I have a 10 node cluster with cassandra 2.0.8. >>>>>>>> >>>>>>>> I am taking this exceptions in the log when I run my code. What my >>>>>>>> code does is just reading data from a CF and in some cases it writes >>>>>>>> new >>>>>>>> data. >>>>>>>> >>>>>>>> WARN [Native-Transport-Requests:553] 2014-06-18 11:04:51,391 >>>>>>>> BatchStatement.java (line 228) Batch of prepared statements for >>>>>>>> [identification1.entity, identification1.entity_lookup] is of size >>>>>>>> 6165, >>>>>>>> exceeding specified threshold of 5120 by 1045. >>>>>>>> WARN [Native-Transport-Requests:583] 2014-06-18 11:05:01,152 >>>>>>>> BatchStatement.java (line 228) Batch of prepared statements for >>>>>>>> [identification1.entity, identification1.entity_lookup] is of size >>>>>>>> 21266, >>>>>>>> exceeding specified threshold of 5120 by 16146. >>>>>>>> WARN [Native-Transport-Requests:581] 2014-06-18 11:05:20,229 >>>>>>>> BatchStatement.java (line 228) Batch of prepared statements for >>>>>>>> [identification1.entity, identification1.entity_lookup] is of size >>>>>>>> 22978, >>>>>>>> exceeding specified threshold of 5120 by 17858. >>>>>>>> INFO [MemoryMeter:1] 2014-06-18 11:05:32,682 Memtable.java (line >>>>>>>> 481) CFS(Keyspace='OpsCenter', ColumnFamily='rollups300') liveRatio is >>>>>>>> 14.249755859375 (just-counted was 9.85302734375). calculation took >>>>>>>> 3ms for >>>>>>>> 1024 cells >>>>>>>> >>>>>>>> After some time, one node of the cluster goes down. Then it goes >>>>>>>> back after some seconds and another node goes down. It keeps happening >>>>>>>> and >>>>>>>> there is always a node down in the cluster, when it goes back another >>>>>>>> one >>>>>>>> falls. >>>>>>>> >>>>>>>> The only exceptions I see in the log is "connected reset by the >>>>>>>> peer", which seems to be relative to gossip protocol, when a node goes >>>>>>>> down. >>>>>>>> >>>>>>>> Any hint of what could I do to investigate this problem further? >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Marcelo Valle. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >