Re: GC freeze just after repair session
Our Young size=800 MB,SurvivorRatio=8,edenSize=640MB. All objects/bytes generated during compaction are garbage right? During compaction, with in_memory_compaction_limit=64MB and concurrent_compactors=8, there is a lot of pressure on ParNew sweeps. I was thinking of decreasing concurrent_compactors and in_memory_compaction_limit to go easy on GC I am not familiar with inner workings of cassandra but hope have diagnosed the problem to a little extent. On Fri, Jul 6, 2012 at 11:27 AM, rohit bhatia wrote: > @ravi, u can increase young gen size, keep a high tenuring rate or > increase survivor ratio.. > > > On Fri, Jul 6, 2012 at 4:03 AM, aaron morton > wrote: > > Ideally we would like to collect maximum garbage from ParNew itself, > during > > compactions. What are the steps to take towards to achieving this? > > > > I'm not sure what you are asking. > > > > Cheers > > > > - > > Aaron Morton > > Freelance Developer > > @aaronmorton > > http://www.thelastpickle.com > > > > On 5/07/2012, at 6:56 PM, Ravikumar Govindarajan wrote: > > > > We have modified maxTenuringThreshold from 1 to 5. May be it is causing > > problems. Will change it back to 1 and see how the system is. > > > > concurrent_compactors=8. We will reduce this, as anyway our system won't > be > > able to handle this number of compactions at the same time. Think it will > > ease GC also to some extent. > > > > Ideally we would like to collect maximum garbage from ParNew itself, > during > > compactions. What are the steps to take towards to achieving this? > > > > On Wed, Jul 4, 2012 at 4:07 PM, aaron morton > > wrote: > >> > >> It *may* have been compaction from the repair, but it's not a big CF. > >> > >> I would look at the logs to see how much data was transferred to the > node. > >> Was their a compaction going on while the GC storm was happening ? Do > you > >> have a lot of secondary indexes ? > >> > >> If you think it correlated to compaction you can try reducing the > >> concurrent_compactors > >> > >> Cheers > >> > >> - > >> Aaron Morton > >> Freelance Developer > >> @aaronmorton > >> http://www.thelastpickle.com > >> > >> On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote: > >> > >> Recently, we faced a severe freeze [around 30-40 mins] on one of our > >> servers. There were many mutations/reads dropped. The issue happened > just > >> after a routine nodetool repair for the below CF completed [1.0.7, NTS, > >> DC1:3,DC2:2] > >> > >> Column Family: MsgIrtConv > >> SSTable count: 12 > >> Space used (live): 17426379140 > >> Space used (total): 17426379140 > >> Number of Keys (estimate): 122624 > >> Memtable Columns Count: 31180 > >> Memtable Data Size: 81950175 > >> Memtable Switch Count: 31 > >> Read Count: 8074156 > >> Read Latency: 15.743 ms. > >> Write Count: 2172404 > >> Write Latency: 0.037 ms. > >> Pending Tasks: 0 > >> Bloom Filter False Postives: 1258 > >> Bloom Filter False Ratio: 0.03598 > >> Bloom Filter Space Used: 498672 > >> Key cache capacity: 20 > >> Key cache size: 20 > >> Key cache hit rate: 0.9965579513062582 > >> Row cache: disabled > >> Compacted row minimum size: 51 > >> Compacted row maximum size: 89970660 > >> Compacted row mean size: 226626 > >> > >> > >> Our heap config is as follows > >> > >> -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC > >> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 > >> -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75 > >> -XX:+UseCMSInitiatingOccupancyOnly > >> > >> from yaml > >> in_memory_compaction_limit=64 > >> compaction_throughput_mb_sec=8 > >> multi_threaded_compaction=false > >> > >> INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085 > AntiEntropyService.java > >> (line 762) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] MsgIrtConv is > >> fully synced > >> INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085 > >> AntiEntropyService.java (line 698) [repair > >> #2b6fcbf0-c1f9-11e1--2ea8811bfbff] session completed successfully > >> INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219 > CompactionTask.java > >> (line 221) Compacted to > >> [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,]. 47,907,012 to > >> 40,554,059 (~84% of original) bytes for 4,564 keys at 6.252080MB/s. > Time: > >> 6,186ms. > >> > >> After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran > >> for every 3 seconds, while CMS ran for every 30 seconds approx > continuous > >> for 40 minutes. > >> > >> INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line > >> 122) GC for ParNew: 776 ms for 2 collections, 2901990208 used; max is > >> 8506048512 > >> INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line > >> 122) GC for ParNew: 2028 ms for 2 collections, 3831282056 used; max is > >> 8506048512 > >> > >> . > >> > >> INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line
Re: GC freeze just after repair session
@ravi, u can increase young gen size, keep a high tenuring rate or increase survivor ratio.. On Fri, Jul 6, 2012 at 4:03 AM, aaron morton wrote: > Ideally we would like to collect maximum garbage from ParNew itself, during > compactions. What are the steps to take towards to achieving this? > > I'm not sure what you are asking. > > Cheers > > - > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 5/07/2012, at 6:56 PM, Ravikumar Govindarajan wrote: > > We have modified maxTenuringThreshold from 1 to 5. May be it is causing > problems. Will change it back to 1 and see how the system is. > > concurrent_compactors=8. We will reduce this, as anyway our system won't be > able to handle this number of compactions at the same time. Think it will > ease GC also to some extent. > > Ideally we would like to collect maximum garbage from ParNew itself, during > compactions. What are the steps to take towards to achieving this? > > On Wed, Jul 4, 2012 at 4:07 PM, aaron morton > wrote: >> >> It *may* have been compaction from the repair, but it's not a big CF. >> >> I would look at the logs to see how much data was transferred to the node. >> Was their a compaction going on while the GC storm was happening ? Do you >> have a lot of secondary indexes ? >> >> If you think it correlated to compaction you can try reducing the >> concurrent_compactors >> >> Cheers >> >> - >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote: >> >> Recently, we faced a severe freeze [around 30-40 mins] on one of our >> servers. There were many mutations/reads dropped. The issue happened just >> after a routine nodetool repair for the below CF completed [1.0.7, NTS, >> DC1:3,DC2:2] >> >> Column Family: MsgIrtConv >> SSTable count: 12 >> Space used (live): 17426379140 >> Space used (total): 17426379140 >> Number of Keys (estimate): 122624 >> Memtable Columns Count: 31180 >> Memtable Data Size: 81950175 >> Memtable Switch Count: 31 >> Read Count: 8074156 >> Read Latency: 15.743 ms. >> Write Count: 2172404 >> Write Latency: 0.037 ms. >> Pending Tasks: 0 >> Bloom Filter False Postives: 1258 >> Bloom Filter False Ratio: 0.03598 >> Bloom Filter Space Used: 498672 >> Key cache capacity: 20 >> Key cache size: 20 >> Key cache hit rate: 0.9965579513062582 >> Row cache: disabled >> Compacted row minimum size: 51 >> Compacted row maximum size: 89970660 >> Compacted row mean size: 226626 >> >> >> Our heap config is as follows >> >> -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC >> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 >> -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75 >> -XX:+UseCMSInitiatingOccupancyOnly >> >> from yaml >> in_memory_compaction_limit=64 >> compaction_throughput_mb_sec=8 >> multi_threaded_compaction=false >> >> INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085 AntiEntropyService.java >> (line 762) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] MsgIrtConv is >> fully synced >> INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085 >> AntiEntropyService.java (line 698) [repair >> #2b6fcbf0-c1f9-11e1--2ea8811bfbff] session completed successfully >> INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219 CompactionTask.java >> (line 221) Compacted to >> [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,]. 47,907,012 to >> 40,554,059 (~84% of original) bytes for 4,564 keys at 6.252080MB/s. Time: >> 6,186ms. >> >> After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran >> for every 3 seconds, while CMS ran for every 30 seconds approx continuous >> for 40 minutes. >> >> INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line >> 122) GC for ParNew: 776 ms for 2 collections, 2901990208 used; max is >> 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line >> 122) GC for ParNew: 2028 ms for 2 collections, 3831282056 used; max is >> 8506048512 >> >> . >> >> INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line >> 122) GC for ParNew: 817 ms for 2 collections, 2808685768 used; max is >> 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 10:07:55,632 GCInspector.java (line >> 122) GC for ParNew: 1165 ms for 3 collections, 3264696776 used; max is >> 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 10:07:57,773 GCInspector.java (line >> 122) GC for ParNew: 1444 ms for 3 collections, 4234372296 used; max is >> 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 10:07:59,387 GCInspector.java (line >> 122) GC for ParNew: 1153 ms for 2 collections, 4910279080 used; max is >> 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 10:08:00,389 GCInspector.java (line >> 122) GC for ParNew: 697 ms for 2 collections, 4873857072 used; max is >> 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 1
Multiple keyspace question
Good evening, I have read multiple keyspaces are bad before in a few discussions, but to what extent? We have some reasonably powerful machines and looking to host an additional (currently we have 1) 2 keyspaces within our cassandra cluster (of 3 nodes, using RF3). At what point does adding extra keyspaces start becoming an issue? Is there anything special we should be considering or watching out for as we implement this? I could not imagine that all cassandra users out there are running one massive keyspace, and at the same time can not imaging that all cassandra users have multiple clusters just to host different keyspaces. Regards. -- -Ben
Re: Finding bottleneck of a cluster
On Fri, Jul 6, 2012 at 9:44 AM, rohit bhatia wrote: > On Fri, Jul 6, 2012 at 4:47 AM, aaron morton wrote: >> 12G Heap, >> 1600Mb Young gen, >> >> Is a bit higher than the normal recommendation. 1600MB young gen can cause >> some extra ParNew pauses. > Thanks for heads up, i'll try tinkering on this > >> >> 128 Concurrent writer >> threads >> >> Unless you are on SSD this is too many. >> > I mean > http://www.datastax.com/docs/0.8/configuration/node_configuration#concurrent-writes > , this is not memtable flush queue writers. > Suggested value is 8*number of cores(16) = 128 itself. >> >> 1) Is using JDK 1.7 any way detrimental to cassandra? >> >> as far as I know it's not fully certified, thanks for trying it :) >> >> 2) What is the max write operation qps that should be expected. Is the >> netflix benchmark also applicable for counter incrmenting tasks? >> >> Counters use a different write path than normal writes and are a bit slower. >> >> To benchmark, get a single node and work out the max throughput. Then >> multiply by the number of nodes and divide by the RF to get a rough idea. >> >> the cpu >> idle time is around 30%, cassandra is not disk bound(insignificant >> read operations and cpu's iowait is around 0.05%) >> >> Wait until compaction kicks in and handle all your inserts. >> >> The os load is around 16-20 and the average write latency is 3ms. >> tpstats do not show any significant pending tasks. >> >> The node is overloaded. What is the write latency for a single thread doing >> as single increment against a node that has not other traffic ? The latency >> for a request is the time spent working and the time spent waiting, once you >> read the max throughput the time spent waiting increases. The SEDA >> architecture is designed to limit the time spent working. The write latency I reported is as reported by datastax opscenter for the total latency of a client's request. This is minimum at .5ms. In contrast, the "local write request latency" as reported by cfstats are around 50 micro seconds but jump to 150 microseconds during the crash. >> >>At this point suddenly, Several nodes start dropping several >> "Mutation" messages. There are also lots of pending >> >> The cluster is overwhelmed. >> >> Almost all the new threads seem to be named >> "pool-2-thread-*". >> >> These are client connection threads. >> >> My guess is that this might be due to the 128 Writer threads not being >> able to perform more writes.( >> >> Yes. >> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L214 >> >> Work out the latency for a single client single node, then start adding >> replication, nodes and load. When the latency increases you are getting to >> the max throughput for that config. > > Also, as mentioned in my second mail, seeing messages like this "Total > time for which application threads were stopped: 16.7663710 seconds", > if something pauses for this long, it might be overwhelmed by the > hints stored at other nodes. This can further cause the node to wait > on/drop a lot of client connection threads. I'll look into what is > causing these non-gc pauses. Thanks for the help. > >> >> Hope that helps >> >> - >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 5/07/2012, at 6:49 PM, rohit bhatia wrote: >> >> Our Cassandra cluster consists of 8 nodes(16 core, 32G ram, 12G Heap, >> 1600Mb Young gen, cassandra1.0.5, JDK 1.7, 128 Concurrent writer >> threads). The replication factor is 2 with 10 column families and we >> service Counter incrementing write intensive tasks(CL=ONE). >> >> I am trying to figure out the bottleneck, >> >> 1) Is using JDK 1.7 any way detrimental to cassandra? >> >> 2) What is the max write operation qps that should be expected. Is the >> netflix benchmark also applicable for counter incrmenting tasks? >> >> http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html >> >> 3) At around 50,000qps for the cluster (~12500 qps per node), the cpu >> idle time is around 30%, cassandra is not disk bound(insignificant >> read operations and cpu's iowait is around 0.05%) and is not swapping >> its memory(around 15 gb RAM is free or inactive). The average gc pause >> time for parnew are 100ms occuring every second. So cassandra spends >> 10% of its time stuck in "Stop the world" collector. >> The os load is around 16-20 and the average write latency is 3ms. >> tpstats do not show any significant pending tasks. >> >>At this point suddenly, Several nodes start dropping several >> "Mutation" messages. There are also lots of pending >> MutationStage,replicateOnWriteStage tasks in tpstats. >> The number of threads in the java process increase to around 25,000 >> from the usual 300-400. Almost all the new threads seem to be named >> "pool-2-thread-*". >> The OS load jumps to around 30-40, the "write request latency" starts >> spiking to more than 500ms (even to several tens of seconds sometim
Re: Finding bottleneck of a cluster
On Fri, Jul 6, 2012 at 4:47 AM, aaron morton wrote: > 12G Heap, > 1600Mb Young gen, > > Is a bit higher than the normal recommendation. 1600MB young gen can cause > some extra ParNew pauses. Thanks for heads up, i'll try tinkering on this > > 128 Concurrent writer > threads > > Unless you are on SSD this is too many. > I mean http://www.datastax.com/docs/0.8/configuration/node_configuration#concurrent-writes , this is not memtable flush queue writers. Suggested value is 8*number of cores(16) = 128 itself. > > 1) Is using JDK 1.7 any way detrimental to cassandra? > > as far as I know it's not fully certified, thanks for trying it :) > > 2) What is the max write operation qps that should be expected. Is the > netflix benchmark also applicable for counter incrmenting tasks? > > Counters use a different write path than normal writes and are a bit slower. > > To benchmark, get a single node and work out the max throughput. Then > multiply by the number of nodes and divide by the RF to get a rough idea. > > the cpu > idle time is around 30%, cassandra is not disk bound(insignificant > read operations and cpu's iowait is around 0.05%) > > Wait until compaction kicks in and handle all your inserts. > > The os load is around 16-20 and the average write latency is 3ms. > tpstats do not show any significant pending tasks. > > The node is overloaded. What is the write latency for a single thread doing > as single increment against a node that has not other traffic ? The latency > for a request is the time spent working and the time spent waiting, once you > read the max throughput the time spent waiting increases. The SEDA > architecture is designed to limit the time spent working. > >At this point suddenly, Several nodes start dropping several > "Mutation" messages. There are also lots of pending > > The cluster is overwhelmed. > > Almost all the new threads seem to be named > "pool-2-thread-*". > > These are client connection threads. > > My guess is that this might be due to the 128 Writer threads not being > able to perform more writes.( > > Yes. > https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L214 > > Work out the latency for a single client single node, then start adding > replication, nodes and load. When the latency increases you are getting to > the max throughput for that config. Also, as mentioned in my second mail, seeing messages like this "Total time for which application threads were stopped: 16.7663710 seconds", if something pauses for this long, it might be overwhelmed by the hints stored at other nodes. This can further cause the node to wait on/drop a lot of client connection threads. I'll look into what is causing these non-gc pauses. Thanks for the help. > > Hope that helps > > - > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 5/07/2012, at 6:49 PM, rohit bhatia wrote: > > Our Cassandra cluster consists of 8 nodes(16 core, 32G ram, 12G Heap, > 1600Mb Young gen, cassandra1.0.5, JDK 1.7, 128 Concurrent writer > threads). The replication factor is 2 with 10 column families and we > service Counter incrementing write intensive tasks(CL=ONE). > > I am trying to figure out the bottleneck, > > 1) Is using JDK 1.7 any way detrimental to cassandra? > > 2) What is the max write operation qps that should be expected. Is the > netflix benchmark also applicable for counter incrmenting tasks? > > http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html > > 3) At around 50,000qps for the cluster (~12500 qps per node), the cpu > idle time is around 30%, cassandra is not disk bound(insignificant > read operations and cpu's iowait is around 0.05%) and is not swapping > its memory(around 15 gb RAM is free or inactive). The average gc pause > time for parnew are 100ms occuring every second. So cassandra spends > 10% of its time stuck in "Stop the world" collector. > The os load is around 16-20 and the average write latency is 3ms. > tpstats do not show any significant pending tasks. > >At this point suddenly, Several nodes start dropping several > "Mutation" messages. There are also lots of pending > MutationStage,replicateOnWriteStage tasks in tpstats. > The number of threads in the java process increase to around 25,000 > from the usual 300-400. Almost all the new threads seem to be named > "pool-2-thread-*". > The OS load jumps to around 30-40, the "write request latency" starts > spiking to more than 500ms (even to several tens of seconds sometime). > Even the "Local write latency" increases fourfolds to 200 microseconds > from 50 microseconds. This happens across all the nodes and in around > 2-3 minutes. > My guess is that this might be due to the 128 Writer threads not being > able to perform more writes.(though with average local write latency > of 100-150 micro seconds, each thread should be able to serve 10,000 > qps and with 128 writer threads, should be able to serve 1,280,000 qps > per node
Re: Composite Slice Query returning non-sliced data
HI Aaron, It is create column family CF with comparator = 'CompositeType(org.apache.cassandra.db.marshal.Int32Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)' and key_validation_class = UTF8Type and default_validation_class = UTF8Type; This is allowing me to insert column names of different type. Thanks, Sunit. On Thu, Jul 5, 2012 at 4:24 PM, aaron morton wrote: > #2 has the Composite Column and #1 does not. > > They are both strings. > > All column names *must* be of the same type. What was your CF definition ? > > Cheers > > - > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 6/07/2012, at 7:26 AM, Sunit Randhawa wrote: > > Hello, > > I have 2 Columns for a 'RowKey' as below: > > #1 : set CF['RowKey']['1000']='A=1,B=2'; > #2: set CF['RowKey']['1000:C1']='A=2,B=3''; > > #2 has the Composite Column and #1 does not. > > Now when I execute the Composite Slice query by 1000 and C1, I do get > both the columns above. > > I am hoping get #2 only since I am specifically providing "C1" as > Start and Finish Composite Range with > Composite.ComponentEquality.EQUAL. > > > I am not sure if this is by design. > > Thanks, > Sunit. > >
Re: Composite Slice Query returning non-sliced data
> #2 has the Composite Column and #1 does not. They are both strings. All column names *must* be of the same type. What was your CF definition ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/07/2012, at 7:26 AM, Sunit Randhawa wrote: > Hello, > > I have 2 Columns for a 'RowKey' as below: > > #1 : set CF['RowKey']['1000']='A=1,B=2'; > #2: set CF['RowKey']['1000:C1']='A=2,B=3''; > > #2 has the Composite Column and #1 does not. > > Now when I execute the Composite Slice query by 1000 and C1, I do get > both the columns above. > > I am hoping get #2 only since I am specifically providing "C1" as > Start and Finish Composite Range with > Composite.ComponentEquality.EQUAL. > > > I am not sure if this is by design. > > Thanks, > Sunit.
Re: batch_mutate
> Does it mean that the popular use case is when we need to update multiple > column families using the same key? Yes. > Shouldn’t we design our space in such a way that those columns live in the > same column family? Design a model where the data for common queries is stored in one row+cf. You can also take into consideration the workload. e.g. things are are updated frequently often live together, things that are updated infrequently often live together. cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/07/2012, at 3:16 AM, Leonid Ilyevsky wrote: > I actually found an answer to my first question at > http://wiki.apache.org/cassandra/API. So I got it wrong: actually the outer > key is the key in the table, and the inner key is the table name (this was > somewhat counter-intuitive). Does it mean that the popular use case is when > we need to update multiple column families using the same key? Shouldn’t we > design our space in such a way that those columns live in the same column > family? > > From: Leonid Ilyevsky [mailto:lilyev...@mooncapital.com] > Sent: Thursday, July 05, 2012 10:39 AM > To: 'user@cassandra.apache.org' > Subject: batch_mutate > > My current way of inserting rows one by one is too slow (I use cql3 prepared > statements) , so I want to try batch_mutate. > > Could anybody give me more details about the interface? In the javadoc it > says: > > public > voidbatch_mutate(java.util.Map>> > mutation_map, > ConsistencyLevel consistency_level) > throws InvalidRequestException, > UnavailableException, > TimedOutException, > org.apache.thrift.TException > Description copied from interface: Cassandra.Iface > Mutate many columns or super columns for many row keys. See also: Mutation. > mutation_map maps key to column family to a list of Mutation objects to take > place at that scope. * > > > I need to understand the meaning of the elements of mutation_map parameter. > My guess is, the key in the outer map is columnfamily name, is this correct? > The key in the inner map is, probably, a key to the columnfamily (it is > somewhat confusing that it is String while the outer key is ByteBuffer, I > wonder what is the rational). If this is correct, how should I do it if my > key is a composite one. Does anybody have an example? > > Thanks, > > Leonid > > This email, along with any attachments, is confidential and may be legally > privileged or otherwise protected from disclosure. Any unauthorized > dissemination, copying or use of the contents of this email is strictly > prohibited and may be in violation of law. If you are not the intended > recipient, any disclosure, copying, forwarding or distribution of this email > is strictly prohibited and this email and any attachments should be deleted > immediately. This email and any attachments do not constitute an offer to > sell or a solicitation of an offer to purchase any interest in any investment > vehicle sponsored by Moon Capital Management LP (“Moon Capital”). Moon > Capital does not provide legal, accounting or tax advice. Any statement > regarding legal, accounting or tax matters was not intended or written to be > relied upon by any person as advice. Moon Capital does not waive > confidentiality or privilege as a result of this email. > > This email, along with any attachments, is confidential and may be legally > privileged or otherwise protected from disclosure. Any unauthorized > dissemination, copying or use of the contents of this email is strictly > prohibited and may be in violation of law. If you are not the intended > recipient, any disclosure, copying, forwarding or distribution of this email > is strictly prohibited and this email and any attachments should be deleted > immediately. This email and any attachments do not constitute an offer to > sell or a solicitation of an offer to purchase any interest in any investment > vehicle sponsored by Moon Capital Management LP (“Moon Capital”). Moon > Capital does not provide legal, accounting or tax advice. Any statement > regarding legal, accounting or tax matters was not intended or written to be > relied upon by any person as advice. Moon Capital does not waive > confidentiality or privilege as a result of this email.
Re: Finding bottleneck of a cluster
> 12G Heap, > 1600Mb Young gen, Is a bit higher than the normal recommendation. 1600MB young gen can cause some extra ParNew pauses. > 128 Concurrent writer > threads Unless you are on SSD this is too many. > 1) Is using JDK 1.7 any way detrimental to cassandra? as far as I know it's not fully certified, thanks for trying it :) > 2) What is the max write operation qps that should be expected. Is the > netflix benchmark also applicable for counter incrmenting tasks? Counters use a different write path than normal writes and are a bit slower. To benchmark, get a single node and work out the max throughput. Then multiply by the number of nodes and divide by the RF to get a rough idea. > the cpu > idle time is around 30%, cassandra is not disk bound(insignificant > read operations and cpu's iowait is around 0.05%) Wait until compaction kicks in and handle all your inserts. > The os load is around 16-20 and the average write latency is 3ms. > tpstats do not show any significant pending tasks. The node is overloaded. What is the write latency for a single thread doing as single increment against a node that has not other traffic ? The latency for a request is the time spent working and the time spent waiting, once you read the max throughput the time spent waiting increases. The SEDA architecture is designed to limit the time spent working. >At this point suddenly, Several nodes start dropping several > "Mutation" messages. There are also lots of pending The cluster is overwhelmed. > Almost all the new threads seem to be named > "pool-2-thread-*". These are client connection threads. > My guess is that this might be due to the 128 Writer threads not being > able to perform more writes.( Yes. https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L214 Work out the latency for a single client single node, then start adding replication, nodes and load. When the latency increases you are getting to the max throughput for that config. Hope that helps - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/07/2012, at 6:49 PM, rohit bhatia wrote: > Our Cassandra cluster consists of 8 nodes(16 core, 32G ram, 12G Heap, > 1600Mb Young gen, cassandra1.0.5, JDK 1.7, 128 Concurrent writer > threads). The replication factor is 2 with 10 column families and we > service Counter incrementing write intensive tasks(CL=ONE). > > I am trying to figure out the bottleneck, > > 1) Is using JDK 1.7 any way detrimental to cassandra? > > 2) What is the max write operation qps that should be expected. Is the > netflix benchmark also applicable for counter incrmenting tasks? > > http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html > > 3) At around 50,000qps for the cluster (~12500 qps per node), the cpu > idle time is around 30%, cassandra is not disk bound(insignificant > read operations and cpu's iowait is around 0.05%) and is not swapping > its memory(around 15 gb RAM is free or inactive). The average gc pause > time for parnew are 100ms occuring every second. So cassandra spends > 10% of its time stuck in "Stop the world" collector. > The os load is around 16-20 and the average write latency is 3ms. > tpstats do not show any significant pending tasks. > >At this point suddenly, Several nodes start dropping several > "Mutation" messages. There are also lots of pending > MutationStage,replicateOnWriteStage tasks in tpstats. > The number of threads in the java process increase to around 25,000 > from the usual 300-400. Almost all the new threads seem to be named > "pool-2-thread-*". > The OS load jumps to around 30-40, the "write request latency" starts > spiking to more than 500ms (even to several tens of seconds sometime). > Even the "Local write latency" increases fourfolds to 200 microseconds > from 50 microseconds. This happens across all the nodes and in around > 2-3 minutes. > My guess is that this might be due to the 128 Writer threads not being > able to perform more writes.(though with average local write latency > of 100-150 micro seconds, each thread should be able to serve 10,000 > qps and with 128 writer threads, should be able to serve 1,280,000 qps > per node) > Could there be any other reason for this? What else should I monitor > since system.log do not seem to say anything conclusive before > dropping messages. > > > > Thanks > Rohit
Re: Composite key in thrift java api
> I would really prefer to do it in Cassandra itself, See https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/marshal/CompositeType.java Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/07/2012, at 10:40 AM, Leonid Ilyevsky wrote: > I need to create a ByteBuffer instance containing the proper composite key, > based on the values of the components of the key. I am going to use it for > update operation. > I tried to simply concatenate the buffers corresponding to the components, > but I am not sure this is correct, because I am getting exception that comes > from the server : > > InvalidRequestException(why:Not enough bytes to read value of component 0) > > In the server log I see this: > > org.apache.thrift.transport.TTransportException: Cannot read. Remote side has > closed. Tried to read 4 bytes, but only got 0 bytes. (This is often > indicative of an internal error on the server side. Please check your server > logs.) > > (I believe here when it says “server side” it actually means client, because > it is the server’s log). > > Seems like the buffer that my client sends is too short. I suspect there is > a way in thrift to do it properly, but I don’t know how. > Looks like Hector has a Composite class that maybe can help, but at this > point I would really prefer to do it in Cassandra itself, without Hector. > > Thanks! > > Leonid > > > This email, along with any attachments, is confidential and may be legally > privileged or otherwise protected from disclosure. Any unauthorized > dissemination, copying or use of the contents of this email is strictly > prohibited and may be in violation of law. If you are not the intended > recipient, any disclosure, copying, forwarding or distribution of this email > is strictly prohibited and this email and any attachments should be deleted > immediately. This email and any attachments do not constitute an offer to > sell or a solicitation of an offer to purchase any interest in any investment > vehicle sponsored by Moon Capital Management LP (“Moon Capital”). Moon > Capital does not provide legal, accounting or tax advice. Any statement > regarding legal, accounting or tax matters was not intended or written to be > relied upon by any person as advice. Moon Capital does not waive > confidentiality or privilege as a result of this email.
Re: Upgrade for Cassandra 0.8.4 to 1.+
Consult the NEWS.txt file for help on upgrading https://github.com/apache/cassandra/blob/trunk/NEWS.txt Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/07/2012, at 2:52 AM, rohit bhatia wrote: > http://cassandra.apache.org/ says 1.1.2 > > On Thu, Jul 5, 2012 at 7:46 PM, Raj N wrote: >> Hi experts, >> I am planning to upgrade from 0.8.4 to 1.+. Whats the latest stable >> version? >> >> Thanks >> -Rajesh
Re: cassandra on re-Start
Sounds like this problem in 1.1.0 https://issues.apache.org/jira/browse/CASSANDRA-4219 upgrade if you are on 1.1.0 If not please paste the entire exception. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/07/2012, at 1:32 AM, puneet loya wrote: > > > -- Forwarded message -- > From: Rob Coli > Date: Mon, Jul 2, 2012 at 11:19 PM > Subject: Re: cassandra on re-Start > To: user@cassandra.apache.org > > > On Mon, Jul 2, 2012 at 5:43 AM, puneet loya wrote: > > When I restarted the system , it is showing the keyspace does not exist. > > > > Not even letting me to create the keyspace with the same name again. > > Paste the error you get. > > =Rob > > -- > =Robert Coli > AIM>ALK - rc...@palominodb.com > YAHOO - rcoli.palominob > SKYPE - rcoli_palominodb > > The name of the keyspace is DA. > On tryinh to create the keyspace it is giving an exception. > I am getting a Ttransport exception for creating the keyspace. > > Previous keyspace 'DA" that was created still exists. Because when i checked > the folders,the folder with name 'DA' still exists but i cannot access it. > > > Cheers, > Puneet
Re: CorruptedBlockException
> But I don't understand, how was all the available space taken away. Take a look on disk at /var/lib/cassandra/data/ and /var/lib/cassandra/commitlog to see what is taking up a lot of space. Cassandra stores the column names as well as the values, so that can take up some space. > it says that while compaction a CorruptedBlockException has occured. Are you able to reproduce this error ? Thanks - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/07/2012, at 12:04 AM, Nury Redjepow wrote: > Hello to all, > > I have cassandra instance I'm trying to use to store millions of file with > size ~ 3MB. Data structure is simple, 1 row for 1 file, with row key being > the id of file. > I'm loaded 1GB of data, and total available space is 10GB. And after a few > hour, all the available space was taken. In log, it says that while > compaction a CorruptedBlockException has occured. But I don't understand, how > was all the available space taken away. > > Data structure > CREATE KEYSPACE largeobjectsWITH placement_strategy = 'SimpleStrategy' > AND strategy_options={replication_factor:1}; > > create column family content > with column_type = 'Standard' > and comparator = 'UTF8Type' > and default_validation_class = 'BytesType' > and key_validation_class = 'TimeUUIDType' > and read_repair_chance = 0.1 > and dclocal_read_repair_chance = 0.0 > and gc_grace = 864000 > and min_compaction_threshold = 4 > and max_compaction_threshold = 32 > and replicate_on_write = true > and compaction_strategy = 'SizeTieredCompactionStrategy' > and caching = 'keys_only'; > > > Log messages > > INFO [FlushWriter:9] 2012-07-04 19:56:00,783 Memtable.java (line 266) Writing > Memtable-content@240294142(3955135/49439187 serialized/live bytes, 91 ops) > INFO [FlushWriter:9] 2012-07-04 19:56:00,814 Memtable.java (line 307) > Completed flushing > /var/lib/cassandra/data/largeobjects/content/largeobjects-content-h > d-1608-Data.db (1991862 bytes) for commitlog position > ReplayPosition(segmentId=24245436475633, position=78253718) > INFO [OptionalTasks:1] 2012-07-04 19:56:02,784 MeteredFlusher.java (line 62) > flushing high-traffic column family CFS(Keyspace='largeobjects', > ColumnFamily=' > content') (estimated 46971537 bytes) > INFO [OptionalTasks:1] 2012-07-04 19:56:02,784 ColumnFamilyStore.java (line > 633) Enqueuing flush of Memtable-content@1755783901(3757723/46971537 > serialized/ > live bytes, 121 ops) > INFO [FlushWriter:9] 2012-07-04 19:56:02,785 Memtable.java (line 266) Writing > Memtable-content@1755783901(3757723/46971537 serialized/live bytes, 121 ops) > INFO [FlushWriter:9] 2012-07-04 19:56:02,835 Memtable.java (line 307) > Completed flushing > /var/lib/cassandra/data/largeobjects/content/largeobjects-content-h > d-1609-Data.db (1894897 bytes) for commitlog position > ReplayPosition(segmentId=24245436475633, position=82028986) > INFO [OptionalTasks:1] 2012-07-04 19:56:04,785 MeteredFlusher.java (line 62) > flushing high-traffic column family CFS(Keyspace='largeobjects', > ColumnFamily=' > content') (estimated 56971025 bytes) > INFO [OptionalTasks:1] 2012-07-04 19:56:04,785 ColumnFamilyStore.java (line > 633) Enqueuing flush of Memtable-content@1441175031(4557682/56971025 > serialized/ > live bytes, 124 ops) > INFO [FlushWriter:9] 2012-07-04 19:56:04,786 Memtable.java (line 266) Writing > Memtable-content@1441175031(4557682/56971025 serialized/live bytes, 124 ops) > INFO [FlushWriter:9] 2012-07-04 19:56:04,814 Memtable.java (line 307) > Completed flushing > /var/lib/cassandra/data/largeobjects/content/largeobjects-content-h > d-1610-Data.db (2287280 bytes) for commitlog position > ReplayPosition(segmentId=24245436475633, position=86604648) > INFO [CompactionExecutor:39] 2012-07-04 19:56:04,815 CompactionTask.java > (line 109) Compacting > [SSTableReader(path='/var/lib/cassandra/data/largeobjects/con > tent/largeobjects-content-hd-1610-Data.db'), > SSTableReader(path='/var/lib/cassandra/data/largeobjects/content/largeobjects-content-hd-1608-Data.db'), > SSTable > Reader(path='/var/lib/cassandra/data/largeobjects/content/largeobjects-content-hd-1609-Data.db'), > SSTableReader(path='/var/lib/cassandra/data/largeobjects/co > ntent/largeobjects-content-hd-1607-Data.db')] > INFO [OptionalTasks:1] 2012-07-04 19:56:05,786 MeteredFlusher.java (line 62) > flushing high-traffic column family CFS(Keyspace='largeobjects', > ColumnFamily=' > content') (estimated 28300225 bytes) > INFO [OptionalTasks:1] 2012-07-04 19:56:05,786 ColumnFamilyStore.java (line > 633) Enqueuing flush of Memtable-content@1828084851(2264018/28300225 > serialized/ > live bytes, 38 ops) > INFO [FlushWriter:9] 2012-07-04 19:56:05,787 Memtable.java (line 266) Writing > Memtable-content@1828084851(2264018/28300225 serialized/live bytes, 38 ops) > INFO [FlushWriter:9] 2012-07-04 19:56:05,823 Memtable.java (line 307) > Completed flushing > /var/lib/cassandr
Composite key in thrift java api
I need to create a ByteBuffer instance containing the proper composite key, based on the values of the components of the key. I am going to use it for update operation. I tried to simply concatenate the buffers corresponding to the components, but I am not sure this is correct, because I am getting exception that comes from the server : InvalidRequestException(why:Not enough bytes to read value of component 0) In the server log I see this: org.apache.thrift.transport.TTransportException: Cannot read. Remote side has closed. Tried to read 4 bytes, but only got 0 bytes. (This is often indicative of an internal error on the server side. Please check your server logs.) (I believe here when it says "server side" it actually means client, because it is the server's log). Seems like the buffer that my client sends is too short. I suspect there is a way in thrift to do it properly, but I don't know how. Looks like Hector has a Composite class that maybe can help, but at this point I would really prefer to do it in Cassandra itself, without Hector. Thanks! Leonid This email, along with any attachments, is confidential and may be legally privileged or otherwise protected from disclosure. Any unauthorized dissemination, copying or use of the contents of this email is strictly prohibited and may be in violation of law. If you are not the intended recipient, any disclosure, copying, forwarding or distribution of this email is strictly prohibited and this email and any attachments should be deleted immediately. This email and any attachments do not constitute an offer to sell or a solicitation of an offer to purchase any interest in any investment vehicle sponsored by Moon Capital Management LP ("Moon Capital"). Moon Capital does not provide legal, accounting or tax advice. Any statement regarding legal, accounting or tax matters was not intended or written to be relied upon by any person as advice. Moon Capital does not waive confidentiality or privilege as a result of this email.
Re: steps to add node in cassandra 1.1.2
1.1. docs for the same http://www.datastax.com/docs/1.1/operations/cluster_management Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/07/2012, at 9:17 PM, prasenjit mukherjee wrote: > I am using cassandar version 1.1.2. I got the document to add node for > version 0.7 : http://www.datastax.com/docs/0.7/getting_started/configuring > > Is it still valid ? Is there a documentation on this topic from > cassandra twiki/docs ? > > -Prasenjit
Re: GC freeze just after repair session
> Ideally we would like to collect maximum garbage from ParNew itself, during > compactions. What are the steps to take towards to achieving this? I'm not sure what you are asking. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/07/2012, at 6:56 PM, Ravikumar Govindarajan wrote: > We have modified maxTenuringThreshold from 1 to 5. May be it is causing > problems. Will change it back to 1 and see how the system is. > > concurrent_compactors=8. We will reduce this, as anyway our system won't be > able to handle this number of compactions at the same time. Think it will > ease GC also to some extent. > > Ideally we would like to collect maximum garbage from ParNew itself, during > compactions. What are the steps to take towards to achieving this? > > On Wed, Jul 4, 2012 at 4:07 PM, aaron morton wrote: > It *may* have been compaction from the repair, but it's not a big CF. > > I would look at the logs to see how much data was transferred to the node. > Was their a compaction going on while the GC storm was happening ? Do you > have a lot of secondary indexes ? > > If you think it correlated to compaction you can try reducing the > concurrent_compactors > > Cheers > > - > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote: > >> Recently, we faced a severe freeze [around 30-40 mins] on one of our >> servers. There were many mutations/reads dropped. The issue happened just >> after a routine nodetool repair for the below CF completed [1.0.7, NTS, >> DC1:3,DC2:2] >> >> Column Family: MsgIrtConv >> SSTable count: 12 >> Space used (live): 17426379140 >> Space used (total): 17426379140 >> Number of Keys (estimate): 122624 >> Memtable Columns Count: 31180 >> Memtable Data Size: 81950175 >> Memtable Switch Count: 31 >> Read Count: 8074156 >> Read Latency: 15.743 ms. >> Write Count: 2172404 >> Write Latency: 0.037 ms. >> Pending Tasks: 0 >> Bloom Filter False Postives: 1258 >> Bloom Filter False Ratio: 0.03598 >> Bloom Filter Space Used: 498672 >> Key cache capacity: 20 >> Key cache size: 20 >> Key cache hit rate: 0.9965579513062582 >> Row cache: disabled >> Compacted row minimum size: 51 >> Compacted row maximum size: 89970660 >> Compacted row mean size: 226626 >> >> >> Our heap config is as follows >> >> -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC >> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 >> -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75 >> -XX:+UseCMSInitiatingOccupancyOnly >> >> from yaml >> in_memory_compaction_limit=64 >> compaction_throughput_mb_sec=8 >> multi_threaded_compaction=false >> >> INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085 AntiEntropyService.java >> (line 762) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] MsgIrtConv is >> fully synced >> INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085 >> AntiEntropyService.java (line 698) [repair >> #2b6fcbf0-c1f9-11e1--2ea8811bfbff] session completed successfully >> INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219 CompactionTask.java >> (line 221) Compacted to >> [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,]. 47,907,012 to >> 40,554,059 (~84% of original) bytes for 4,564 keys at 6.252080MB/s. Time: >> 6,186ms. >> >> After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran for >> every 3 seconds, while CMS ran for every 30 seconds approx continuous for 40 >> minutes. >> >> INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line 122) >> GC for ParNew: 776 ms for 2 collections, 2901990208 used; max is 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line 122) >> GC for ParNew: 2028 ms for 2 collections, 3831282056 used; max is 8506048512 >> >> . >> >> INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line 122) >> GC for ParNew: 817 ms for 2 collections, 2808685768 used; max is 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 10:07:55,632 GCInspector.java (line 122) >> GC for ParNew: 1165 ms for 3 collections, 3264696776 used; max is 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 10:07:57,773 GCInspector.java (line 122) >> GC for ParNew: 1444 ms for 3 collections, 4234372296 used; max is 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 10:07:59,387 GCInspector.java (line 122) >> GC for ParNew: 1153 ms for 2 collections, 4910279080 used; max is 8506048512 >> INFO [ScheduledTasks:1] 2012-06-29 10:08:00,389 GCInspector
Re: Thrift version and OOM errors
agree. It's a good idea to remove as many variables and possible and get to a stable/known state. Use a clean install and a well known client and see if the problems persist. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/07/2012, at 4:58 PM, Tristan Seligmann wrote: > On Jul 4, 2012 2:02 PM, "Vasileios Vlachos" > wrote: > > > > Any ideas what could be causing strange message lengths? > > One cause of this that I've seen is a client using unframed Thrift transport > while the server expects framed, or vice versa. I suppose a similar cause > could be something that is not a Thrift client at all mistakenly connecting > to Cassandra's Thrift port. >
Re: Enable CQL3 from Astyanax
Can you provide an example ? select * should return all the columns from the CF. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/07/2012, at 4:31 AM, Thierry Templier wrote: > Thanks Aaron. > > I wonder if it's possible to obtain columns from a CQL 3 select query (with a > select *) that aren't defined in the create table. These fields are present > when all attributes are loaded but not when using CQL3. Is it the normal > behavior? Thanks very much! > > Thierry > >> Thanks for contributing. >> >> I'm behind the curve on CQL 3, but here is a post about some of the changes >> http://www.datastax.com/dev/blog/whats-new-in-cql-3-0 >> >> Cheers >> >> >> - >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 28/06/2012, at 2:30 AM, Thierry Templier wrote: >> >>> Hello Aaron, >>> >>> I created an issue on the Astyanax github for this problem. I added a fix >>> to support CQL3 in the tool. >>> See the link https://github.com/Netflix/astyanax/issues/75. >>> >>> Thierry >>> Had a quick look, the current master does not appear to support it. Cheers >> > >
CQL 3 with a right API
Hi I am new to to Cassandra and we started with 1.1 and modeled everything with Composite columns and wide rows and chose CQL 3 even if it is beta. Since I could not find a way in Hector to set CQL 3, I started with Thrift and prototyped all my scenarios with Thrift including retrieving all row keys (without CQL). Recently I saw a JDBC driver for 1.1.1 and it is so promising (slightly slower than thrift in most of my scenarios). Apparently JDBC "will be" the ultimate Java API for Cassandra, so the question is: Since there is no distinct clause in CQL 3, is there a way to retrieve all row keys "with JDBC" without browsing all columns of the CF (and make it distinct yourself) ? Thanks Shahryar Sedghi -- "Life is what happens while you are making other plans." ~ John Lennon
Composite Slice Query returning non-sliced data
Hello, I have 2 Columns for a 'RowKey' as below: #1 : set CF['RowKey']['1000']='A=1,B=2'; #2: set CF['RowKey']['1000:C1']='A=2,B=3''; #2 has the Composite Column and #1 does not. Now when I execute the Composite Slice query by 1000 and C1, I do get both the columns above. I am hoping get #2 only since I am specifically providing "C1" as Start and Finish Composite Range with Composite.ComponentEquality.EQUAL. I am not sure if this is by design. Thanks, Sunit.
JNA on Windows
Hello. I have a question regarding JNA and Windows. I read about the problem that when taking snapshots might require the process space x 2 due to how hardlinks are created. Is JNA for Windows supported? Looking at jira issue https://issues.apache.org/jira/browse/CASSANDRA-1371 looks like it but checking in the Cassandra code base org.apache.cassandra.utils.CLibrary the only thing I see, is Native.register("c") which tries to load the c-library but I think doesn't exists on Windows which will result in creating links with cmd or fsutil and which might then triggger these extensive memory requirements. I'd be happy if someone could shed some light on this issue. Regards /Fredrik
Re: Wide rows and reads
>From what I understand, wide rows have quite a bit of overhead, especially if you are picking columns that are far apart from each other for a given row. This post by Aaron Morton was quite good at explaining this issue http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ -Phil On Thu, Jul 5, 2012 at 12:17 PM, Oleg Dulin wrote: > Here is my flow: > > One process write a really wide row (250K+ supercolumns, each one with 5 > subcolumns, for the total of 1K or so per supercolumn) > > Second process comes in literally 2-3 seconds later and starts reading > from it. > > My observation is that nothing good happens. It is ridiculously slow to > read. It seems that if I wait long enough, the reads from that row will be > much faster. > > Could someone enlighten me as to what exactly happens when I do this ? > > Regards, > Oleg > > >
Wide rows and reads
Here is my flow: One process write a really wide row (250K+ supercolumns, each one with 5 subcolumns, for the total of 1K or so per supercolumn) Second process comes in literally 2-3 seconds later and starts reading from it. My observation is that nothing good happens. It is ridiculously slow to read. It seems that if I wait long enough, the reads from that row will be much faster. Could someone enlighten me as to what exactly happens when I do this ? Regards, Oleg
RE: batch_mutate
I actually found an answer to my first question at http://wiki.apache.org/cassandra/API. So I got it wrong: actually the outer key is the key in the table, and the inner key is the table name (this was somewhat counter-intuitive). Does it mean that the popular use case is when we need to update multiple column families using the same key? Shouldn't we design our space in such a way that those columns live in the same column family? From: Leonid Ilyevsky [mailto:lilyev...@mooncapital.com] Sent: Thursday, July 05, 2012 10:39 AM To: 'user@cassandra.apache.org' Subject: batch_mutate My current way of inserting rows one by one is too slow (I use cql3 prepared statements) , so I want to try batch_mutate. Could anybody give me more details about the interface? In the javadoc it says: public void batch_mutate(java.util.Map>>> mutation_map, ConsistencyLevel consistency_level) throws InvalidRequestException, UnavailableException, TimedOutException, org.apache.thrift.TException Description copied from interface: Cassandra.Iface Mutate many columns or super columns for many row keys. See also: Mutation. mutation_map maps key to column family to a list of Mutation objects to take place at that scope. * I need to understand the meaning of the elements of mutation_map parameter. My guess is, the key in the outer map is columnfamily name, is this correct? The key in the inner map is, probably, a key to the columnfamily (it is somewhat confusing that it is String while the outer key is ByteBuffer, I wonder what is the rational). If this is correct, how should I do it if my key is a composite one. Does anybody have an example? Thanks, Leonid This email, along with any attachments, is confidential and may be legally privileged or otherwise protected from disclosure. Any unauthorized dissemination, copying or use of the contents of this email is strictly prohibited and may be in violation of law. If you are not the intended recipient, any disclosure, copying, forwarding or distribution of this email is strictly prohibited and this email and any attachments should be deleted immediately. This email and any attachments do not constitute an offer to sell or a solicitation of an offer to purchase any interest in any investment vehicle sponsored by Moon Capital Management LP ("Moon Capital"). Moon Capital does not provide legal, accounting or tax advice. Any statement regarding legal, accounting or tax matters was not intended or written to be relied upon by any person as advice. Moon Capital does not waive confidentiality or privilege as a result of this email. This email, along with any attachments, is confidential and may be legally privileged or otherwise protected from disclosure. Any unauthorized dissemination, copying or use of the contents of this email is strictly prohibited and may be in violation of law. If you are not the intended recipient, any disclosure, copying, forwarding or distribution of this email is strictly prohibited and this email and any attachments should be deleted immediately. This email and any attachments do not constitute an offer to sell or a solicitation of an offer to purchase any interest in any investment vehicle sponsored by Moon Capital Management LP ("Moon Capital"). Moon Capital does not provide legal, accounting or tax advice. Any statement regarding legal, accounting or tax matters was not intended or written to be relied upon by any person as advice. Moon Capital does not waive confidentiality or privilege as a result of this email.
Re: Upgrade for Cassandra 0.8.4 to 1.+
http://cassandra.apache.org/ says 1.1.2 On Thu, Jul 5, 2012 at 7:46 PM, Raj N wrote: > Hi experts, > I am planning to upgrade from 0.8.4 to 1.+. Whats the latest stable > version? > > Thanks > -Rajesh
batch_mutate
My current way of inserting rows one by one is too slow (I use cql3 prepared statements) , so I want to try batch_mutate. Could anybody give me more details about the interface? In the javadoc it says: public void batch_mutate(java.util.Map>>> mutation_map, ConsistencyLevel consistency_level) throws InvalidRequestException, UnavailableException, TimedOutException, org.apache.thrift.TException Description copied from interface: Cassandra.Iface Mutate many columns or super columns for many row keys. See also: Mutation. mutation_map maps key to column family to a list of Mutation objects to take place at that scope. * I need to understand the meaning of the elements of mutation_map parameter. My guess is, the key in the outer map is columnfamily name, is this correct? The key in the inner map is, probably, a key to the columnfamily (it is somewhat confusing that it is String while the outer key is ByteBuffer, I wonder what is the rational). If this is correct, how should I do it if my key is a composite one. Does anybody have an example? Thanks, Leonid This email, along with any attachments, is confidential and may be legally privileged or otherwise protected from disclosure. Any unauthorized dissemination, copying or use of the contents of this email is strictly prohibited and may be in violation of law. If you are not the intended recipient, any disclosure, copying, forwarding or distribution of this email is strictly prohibited and this email and any attachments should be deleted immediately. This email and any attachments do not constitute an offer to sell or a solicitation of an offer to purchase any interest in any investment vehicle sponsored by Moon Capital Management LP ("Moon Capital"). Moon Capital does not provide legal, accounting or tax advice. Any statement regarding legal, accounting or tax matters was not intended or written to be relied upon by any person as advice. Moon Capital does not waive confidentiality or privilege as a result of this email.
Upgrade for Cassandra 0.8.4 to 1.+
Hi experts, I am planning to upgrade from 0.8.4 to 1.+. Whats the latest stable version? Thanks -Rajesh
Re: cassandra on re-Start
-- Forwarded message -- From: Rob Coli Date: Mon, Jul 2, 2012 at 11:19 PM Subject: Re: cassandra on re-Start To: user@cassandra.apache.org On Mon, Jul 2, 2012 at 5:43 AM, puneet loya wrote: > When I restarted the system , it is showing the keyspace does not exist. > > Not even letting me to create the keyspace with the same name again. Paste the error you get. =Rob -- =Robert Coli AIM>ALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb The name of the keyspace is DA. On tryinh to create the keyspace it is giving an exception. I am getting a Ttransport exception for creating the keyspace. Previous keyspace 'DA" that was created still exists. Because when i checked the folders,the folder with name 'DA' still exists but i cannot access it. Cheers, Puneet
Re: Finding bottleneck of a cluster
Also, Looking at gc log. I see messages like this across different servers before they start dropping messages "2012-07-04T10:48:20.336+: 96771.117: [GC 96771.118: [ParNew: 1367297K->57371K(1474560K), 0.0617350 secs] 6641571K->5340088K(12419072K), 0.0634460 secs] [Times: user=0.56 sys=0.01, real=0.06 secs] Total time for which application threads were stopped: 0.0850010 seconds Total time for which application threads were stopped: 16.7663710 seconds" The 16 second pause doesnt seem to be caused by the minor/major gc which are quite fast and are also logged. "Total time for which ..." messages are caused by PrintGCApplicationStoppedTime paramater which is supposed to be logged whenever threads reach a safepoint. Is there any way I can figure out what caused the java threads to pause. Thanks Rohit On Thu, Jul 5, 2012 at 12:19 PM, rohit bhatia wrote: > Our Cassandra cluster consists of 8 nodes(16 core, 32G ram, 12G Heap, > 1600Mb Young gen, cassandra1.0.5, JDK 1.7, 128 Concurrent writer > threads). The replication factor is 2 with 10 column families and we > service Counter incrementing write intensive tasks(CL=ONE). > > I am trying to figure out the bottleneck, > > 1) Is using JDK 1.7 any way detrimental to cassandra? > > 2) What is the max write operation qps that should be expected. Is the > netflix benchmark also applicable for counter incrmenting tasks? > > http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html > > 3) At around 50,000qps for the cluster (~12500 qps per node), the cpu > idle time is around 30%, cassandra is not disk bound(insignificant > read operations and cpu's iowait is around 0.05%) and is not swapping > its memory(around 15 gb RAM is free or inactive). The average gc pause > time for parnew are 100ms occuring every second. So cassandra spends > 10% of its time stuck in "Stop the world" collector. > The os load is around 16-20 and the average write latency is 3ms. > tpstats do not show any significant pending tasks. > > At this point suddenly, Several nodes start dropping several > "Mutation" messages. There are also lots of pending > MutationStage,replicateOnWriteStage tasks in tpstats. > The number of threads in the java process increase to around 25,000 > from the usual 300-400. Almost all the new threads seem to be named > "pool-2-thread-*". > The OS load jumps to around 30-40, the "write request latency" starts > spiking to more than 500ms (even to several tens of seconds sometime). > Even the "Local write latency" increases fourfolds to 200 microseconds > from 50 microseconds. This happens across all the nodes and in around > 2-3 minutes. > My guess is that this might be due to the 128 Writer threads not being > able to perform more writes.(though with average local write latency > of 100-150 micro seconds, each thread should be able to serve 10,000 > qps and with 128 writer threads, should be able to serve 1,280,000 qps > per node) > Could there be any other reason for this? What else should I monitor > since system.log do not seem to say anything conclusive before > dropping messages. > > > > Thanks > Rohit
CorruptedBlockException
Hello to all, I have cassandra instance I'm trying to use to store millions of file with size ~ 3MB. Data structure is simple, 1 row for 1 file, with row key being the id of file. I'm loaded 1GB of data, and total available space is 10GB. And after a few hour, all the available space was taken. In log, it says that while compaction a CorruptedBlockException has occured. But I don't understand, how was all the available space taken away. Data structure CREATE KEYSPACE largeobjectsWITH placement_strategy = 'SimpleStrategy' AND strategy_options={replication_factor:1}; create column family content with column_type = 'Standard' and comparator = 'UTF8Type' and default_validation_class = 'BytesType' and key_validation_class = 'TimeUUIDType' and read_repair_chance = 0.1 and dclocal_read_repair_chance = 0.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and compaction_strategy = 'SizeTieredCompactionStrategy' and caching = 'keys_only'; Log messages INFO [FlushWriter:9] 2012-07-04 19:56:00,783 Memtable.java (line 266) Writing Memtable-content@240294142(3955135/49439187 serialized/live bytes, 91 ops) INFO [FlushWriter:9] 2012-07-04 19:56:00,814 Memtable.java (line 307) Completed flushing /var/lib/cassandra/data/largeobjects/content/largeobjects-content-h d-1608-Data.db (1991862 bytes) for commitlog position ReplayPosition(segmentId=24245436475633, position=78253718) INFO [OptionalTasks:1] 2012-07-04 19:56:02,784 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='largeobjects', ColumnFamily=' content') (estimated 46971537 bytes) INFO [OptionalTasks:1] 2012-07-04 19:56:02,784 ColumnFamilyStore.java (line 633) Enqueuing flush of Memtable-content@1755783901(3757723/46971537 serialized/ live bytes, 121 ops) INFO [FlushWriter:9] 2012-07-04 19:56:02,785 Memtable.java (line 266) Writing Memtable-content@1755783901(3757723/46971537 serialized/live bytes, 121 ops) INFO [FlushWriter:9] 2012-07-04 19:56:02,835 Memtable.java (line 307) Completed flushing /var/lib/cassandra/data/largeobjects/content/largeobjects-content-h d-1609-Data.db (1894897 bytes) for commitlog position ReplayPosition(segmentId=24245436475633, position=82028986) INFO [OptionalTasks:1] 2012-07-04 19:56:04,785 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='largeobjects', ColumnFamily=' content') (estimated 56971025 bytes) INFO [OptionalTasks:1] 2012-07-04 19:56:04,785 ColumnFamilyStore.java (line 633) Enqueuing flush of Memtable-content@1441175031(4557682/56971025 serialized/ live bytes, 124 ops) INFO [FlushWriter:9] 2012-07-04 19:56:04,786 Memtable.java (line 266) Writing Memtable-content@1441175031(4557682/56971025 serialized/live bytes, 124 ops) INFO [FlushWriter:9] 2012-07-04 19:56:04,814 Memtable.java (line 307) Completed flushing /var/lib/cassandra/data/largeobjects/content/largeobjects-content-h d-1610-Data.db (2287280 bytes) for commitlog position ReplayPosition(segmentId=24245436475633, position=86604648) INFO [CompactionExecutor:39] 2012-07-04 19:56:04,815 CompactionTask.java (line 109) Compacting [SSTableReader(path='/var/lib/cassandra/data/largeobjects/con tent/largeobjects-content-hd-1610-Data.db'), SSTableReader(path='/var/lib/cassandra/data/largeobjects/content/largeobjects-content-hd-1608-Data.db'), SSTable Reader(path='/var/lib/cassandra/data/largeobjects/content/largeobjects-content-hd-1609-Data.db'), SSTableReader(path='/var/lib/cassandra/data/largeobjects/co ntent/largeobjects-content-hd-1607-Data.db')] INFO [OptionalTasks:1] 2012-07-04 19:56:05,786 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='largeobjects', ColumnFamily=' content') (estimated 28300225 bytes) INFO [OptionalTasks:1] 2012-07-04 19:56:05,786 ColumnFamilyStore.java (line 633) Enqueuing flush of Memtable-content@1828084851(2264018/28300225 serialized/ live bytes, 38 ops) INFO [FlushWriter:9] 2012-07-04 19:56:05,787 Memtable.java (line 266) Writing Memtable-content@1828084851(2264018/28300225 serialized/live bytes, 38 ops) INFO [FlushWriter:9] 2012-07-04 19:56:05,823 Memtable.java (line 307) Completed flushing /var/lib/cassandra/data/largeobjects/content/largeobjects-content-h d-1612-Data.db (1134604 bytes) for commitlog position ReplayPosition(segmentId=24245436475633, position=88874176) ERROR [CompactionExecutor:39] 2012-07-04 19:56:06,667 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[CompactionExecutor:39,1,main] java.io.IOError: org.apache.cassandra.io.compress.CorruptedBlockException: (/var/lib/cassandra/data/largeobjects/content/largeobjects-content-hd-1610-Data.db ): corruption detected, chunk at 1573104 of length 65545. at org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116) at org.apache.cassandra.db.compaction.PrecompactedRow.(PrecompactedRow.java:99) at org.apache.cass
London meetup - 16th July
The next London meetup is coming up on 16th July. We've got two speakers - Richard Churchill talking about his experiences rolling out Cassandra at ServiceTick and Tom Wilkie talking about real time analytics on top of Cassandra. http://www.meetup.com/Cassandra-London/events/69791362/ Dave
steps to add node in cassandra 1.1.2
I am using cassandar version 1.1.2. I got the document to add node for version 0.7 : http://www.datastax.com/docs/0.7/getting_started/configuring Is it still valid ? Is there a documentation on this topic from cassandra twiki/docs ? -Prasenjit
Re: datastax aws ami
i did with no luck. i got my fire put out. for some reason one of my nodes upgraded itself after rebooting to fix the leap second bug. i use apt-get to put on 1.0.8. seeing that my cluster was running 1.0.7 i had to upgrade the rest of the nodes. upgrading was very simple, stop, apt-get install, start one node at a time. thanks, deno On 7/4/2012 3:18 AM, aaron morton wrote: Try the data stax forums http://www.datastax.com/support-forums/ Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 3/07/2012, at 7:28 AM, Deno Vichas wrote: is the 2.1 image still around? On 7/2/2012 11:24 AM, Deno Vichas wrote: all, i've got a datastax 2.1 ami instance that's screwed up. for some reason it won't read the config file. what's the recommended way to replace this node with a new one? it doesn't seem like you can use the ami to bring up single nodes as it want to do whole clusters. thanks, deno