column with TTL of 10 seconds lives very long...
Hi! I have Cassandra cluster with 3 node running version 1.0.11. I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 tokLogo.png
Re: column with TTL of 10 seconds lives very long...
Did you synchronized the clocks between servers? On Thu, May 23, 2013 at 9:32 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I have Cassandra cluster with 3 node running version 1.0.11. I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 tokLogo.png
Re: column with TTL of 10 seconds lives very long...
Thanks for the response. Running date simultaneously on all nodes (using parallel ssh) shows that they are synced. Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, May 23, 2013 at 12:29 PM, Nikolay Mihaylov n...@nmmm.nu wrote: Did you synchronized the clocks between servers? On Thu, May 23, 2013 at 9:32 AM, Tamar Fraenkel ta...@tok-media.comwrote: Hi! I have Cassandra cluster with 3 node running version 1.0.11. I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 tokLogo.pngtokLogo.png
AW: column with TTL of 10 seconds lives very long...
This is interesting as it might affect me too :) I have been observing deadlocks with HLockManagerImpl which dont get resolved for a long time even though the columns with the locks should only live for about 5-10secs. Any ideas how to investigate this further from the Cassandra-side? Von: Tamar Fraenkel [ta...@tok-media.com] Gesendet: Donnerstag, 23. Mai 2013 11:58 An: user@cassandra.apache.org Betreff: Re: column with TTL of 10 seconds lives very long... Thanks for the response. Running date simultaneously on all nodes (using parallel ssh) shows that they are synced. Tamar Tamar Fraenkel Senior Software Engineer, TOK Media [Inline image 1] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, May 23, 2013 at 12:29 PM, Nikolay Mihaylov n...@nmmm.numailto:n...@nmmm.nu wrote: Did you synchronized the clocks between servers? On Thu, May 23, 2013 at 9:32 AM, Tamar Fraenkel ta...@tok-media.commailto:ta...@tok-media.com wrote: Hi! I have Cassandra cluster with 3 node running version 1.0.11. I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? Thanks, Tamar Fraenkel Senior Software Engineer, TOK Media [Inline image 1] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 inline: tokLogo.png
RE: column with TTL of 10 seconds lives very long...
Maybe you didn't set the TTL correctly. Check the TTL of the column using CQL, e.g.: SELECT TTL (colName) from colFamilyName WHERE condition; From: Felipe Sere [mailto:felipe.s...@1und1.de] Sent: Thursday, May 23, 2013 1:28 PM To: user@cassandra.apache.org Subject: AW: column with TTL of 10 seconds lives very long... This is interesting as it might affect me too :) I have been observing deadlocks with HLockManagerImpl which dont get resolved for a long time even though the columns with the locks should only live for about 5-10secs. Any ideas how to investigate this further from the Cassandra-side? Von: Tamar Fraenkel [ta...@tok-media.com] Gesendet: Donnerstag, 23. Mai 2013 11:58 An: user@cassandra.apache.orgmailto:user@cassandra.apache.org Betreff: Re: column with TTL of 10 seconds lives very long... Thanks for the response. Running date simultaneously on all nodes (using parallel ssh) shows that they are synced. Tamar Tamar Fraenkel Senior Software Engineer, TOK Media [cid:image001.png@01CE57BD.9C67B200] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, May 23, 2013 at 12:29 PM, Nikolay Mihaylov n...@nmmm.numailto:n...@nmmm.nu wrote: Did you synchronized the clocks between servers? On Thu, May 23, 2013 at 9:32 AM, Tamar Fraenkel ta...@tok-media.commailto:ta...@tok-media.com wrote: Hi! I have Cassandra cluster with 3 node running version 1.0.11. I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? Thanks, Tamar Fraenkel Senior Software Engineer, TOK Media [cid:image001.png@01CE57BD.9C67B200] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___ inline: image001.png
Re: column with TTL of 10 seconds lives very long...
Hi! TTL was set: [default@HLockingManager] get HLocks['/LockedTopic/31a30c12-652d-45b3-9ac2-0401cce85517']; = (column=69b057d4-3578-4326-a9d9-c975cb8316d2, value=36396230353764342d333537382d343332362d613964392d633937356362383331366432, timestamp=1369307815049000, ttl=10) Also, all other lock columns expire as expected. Thanks, Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, May 23, 2013 at 1:58 PM, moshe.kr...@barclays.com wrote: Maybe you didn’t set the TTL correctly. Check the TTL of the column using CQL, e.g.: SELECT TTL (colName) from colFamilyName WHERE condition; ** ** *From:* Felipe Sere [mailto:felipe.s...@1und1.de] *Sent:* Thursday, May 23, 2013 1:28 PM *To:* user@cassandra.apache.org *Subject:* AW: column with TTL of 10 seconds lives very long... ** ** This is interesting as it might affect me too :) I have been observing deadlocks with HLockManagerImpl which dont get resolved for a long time even though the columns with the locks should only live for about 5-10secs. Any ideas how to investigate this further from the Cassandra-side? -- *Von:* Tamar Fraenkel [ta...@tok-media.com] *Gesendet:* Donnerstag, 23. Mai 2013 11:58 *An:* user@cassandra.apache.org *Betreff:* Re: column with TTL of 10 seconds lives very long... Thanks for the response. Running date simultaneously on all nodes (using parallel ssh) shows that they are synced. Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 ** ** ** ** ** ** On Thu, May 23, 2013 at 12:29 PM, Nikolay Mihaylov n...@nmmm.nu wrote:** ** Did you synchronized the clocks between servers? ** ** On Thu, May 23, 2013 at 9:32 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I have Cassandra cluster with 3 node running version 1.0.11. I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 ** ** ** ** ** ** ** ** ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___ tokLogo.pngimage001.png
RE: column with TTL of 10 seconds lives very long...
(Probably will not solve your problem, but worth mentioning): It's not enough to check that the clocks of all the servers are synchronized - I believe that the client node sets the timestamp for a record being written. So, you should also check the timestamp on your Hector client nodes. From: Tamar Fraenkel [mailto:ta...@tok-media.com] Sent: Thursday, May 23, 2013 2:17 PM To: user@cassandra.apache.org Subject: Re: column with TTL of 10 seconds lives very long... Hi! TTL was set: [default@HLockingManager] get HLocks['/LockedTopic/31a30c12-652d-45b3-9ac2-0401cce85517']; = (column=69b057d4-3578-4326-a9d9-c975cb8316d2, value=36396230353764342d333537382d343332362d613964392d633937356362383331366432, timestamp=1369307815049000, ttl=10) Also, all other lock columns expire as expected. Thanks, Tamar Tamar Fraenkel Senior Software Engineer, TOK Media [cid:image001.png@01CE57C1.5D7C60A0] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, May 23, 2013 at 1:58 PM, moshe.kr...@barclays.commailto:moshe.kr...@barclays.com wrote: Maybe you didn't set the TTL correctly. Check the TTL of the column using CQL, e.g.: SELECT TTL (colName) from colFamilyName WHERE condition; From: Felipe Sere [mailto:felipe.s...@1und1.demailto:felipe.s...@1und1.de] Sent: Thursday, May 23, 2013 1:28 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: AW: column with TTL of 10 seconds lives very long... This is interesting as it might affect me too :) I have been observing deadlocks with HLockManagerImpl which dont get resolved for a long time even though the columns with the locks should only live for about 5-10secs. Any ideas how to investigate this further from the Cassandra-side? Von: Tamar Fraenkel [ta...@tok-media.commailto:ta...@tok-media.com] Gesendet: Donnerstag, 23. Mai 2013 11:58 An: user@cassandra.apache.orgmailto:user@cassandra.apache.org Betreff: Re: column with TTL of 10 seconds lives very long... Thanks for the response. Running date simultaneously on all nodes (using parallel ssh) shows that they are synced. Tamar Tamar Fraenkel Senior Software Engineer, TOK Media [cid:image001.png@01CE57C1.5D7C60A0] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736tel:%2B972%202%206409736 Mob: +972 54 8356490tel:%2B972%2054%208356490 Fax: +972 2 5612956tel:%2B972%202%205612956 On Thu, May 23, 2013 at 12:29 PM, Nikolay Mihaylov n...@nmmm.numailto:n...@nmmm.nu wrote: Did you synchronized the clocks between servers? On Thu, May 23, 2013 at 9:32 AM, Tamar Fraenkel ta...@tok-media.commailto:ta...@tok-media.com wrote: Hi! I have Cassandra cluster with 3 node running version 1.0.11. I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? Thanks, Tamar Fraenkel Senior Software Engineer, TOK Media [cid:image001.png@01CE57C1.5D7C60A0] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736tel:%2B972%202%206409736 Mob: +972 54 8356490tel:%2B972%2054%208356490 Fax: +972 2 5612956tel:%2B972%202%205612956 ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimerhttp://www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimerhttp://www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___ ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.
Re: column with TTL of 10 seconds lives very long...
good point! *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, May 23, 2013 at 2:25 PM, moshe.kr...@barclays.com wrote: (Probably will not solve your problem, but worth mentioning): It’s not enough to check that the clocks of all the servers are synchronized – I believe that the client node sets the timestamp for a record being written. So, you should also check the timestamp on your Hector client nodes. ** ** *From:* Tamar Fraenkel [mailto:ta...@tok-media.com] *Sent:* Thursday, May 23, 2013 2:17 PM *To:* user@cassandra.apache.org *Subject:* Re: column with TTL of 10 seconds lives very long... ** ** Hi! TTL was set: [default@HLockingManager] get HLocks['/LockedTopic/31a30c12-652d-45b3-9ac2-0401cce85517']; = (column=69b057d4-3578-4326-a9d9-c975cb8316d2, value=36396230353764342d333537382d343332362d613964392d633937356362383331366432, timestamp=1369307815049000, ttl=10) Also, all other lock columns expire as expected. Thanks, Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 ** ** ** ** ** ** On Thu, May 23, 2013 at 1:58 PM, moshe.kr...@barclays.com wrote: Maybe you didn’t set the TTL correctly. Check the TTL of the column using CQL, e.g.: SELECT TTL (colName) from colFamilyName WHERE condition; *From:* Felipe Sere [mailto:felipe.s...@1und1.de] *Sent:* Thursday, May 23, 2013 1:28 PM *To:* user@cassandra.apache.org *Subject:* AW: column with TTL of 10 seconds lives very long... This is interesting as it might affect me too :) I have been observing deadlocks with HLockManagerImpl which dont get resolved for a long time even though the columns with the locks should only live for about 5-10secs. Any ideas how to investigate this further from the Cassandra-side? -- *Von:* Tamar Fraenkel [ta...@tok-media.com] *Gesendet:* Donnerstag, 23. Mai 2013 11:58 *An:* user@cassandra.apache.org *Betreff:* Re: column with TTL of 10 seconds lives very long... Thanks for the response. Running date simultaneously on all nodes (using parallel ssh) shows that they are synced. Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, May 23, 2013 at 12:29 PM, Nikolay Mihaylov n...@nmmm.nu wrote:** ** Did you synchronized the clocks between servers? On Thu, May 23, 2013 at 9:32 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I have Cassandra cluster with 3 node running version 1.0.11. I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___ ** ** ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays
Re: Commit Log Magic
Sstables must be sorted by token, or we can't compact efficiently. Since writes usually do not arrive in token order, we stage them first in a memtable. (cc user@) On Thu, May 23, 2013 at 8:44 AM, Ansar Rafique ansa...@hotmail.com wrote: Hi Jonathan, I am Ansar Rafique and I asked you few questions 2 week ago about Cassandra Implementation. I was watching your presentation where you suggested the page below. http://nosql.mypopescu.com/post/27684111441/cassandra-and-solid-state-drives I have a question and I have tried to find the answer but didn't really get satisfactory response yet. My question is why Cassandra using Commit log for durability instead direct write to SSTable. Cassandra acheives high write throughput because it stores data first in memtable and then flush into disk. Sounds good but remeber Cassandra also write in commit log for durability. I made it sure and it's written that write to memetable and commit log is synchronous which means it will write first in commit log and wait until it complete and will start writing in memtable or vice versa. Writing transaction to commit log requires an I/O operation which means for each insert we need an I/O :( for writing data in commit log and later requires more I/O's to flush data again on disk. Isn't writing to commit log is overhead ? Isn't it better to directly write data on disk instead of commit log ? Remember I/O operations are expensive and reduction in I/O's mean improvement in performance. If we look at RDBMS, it stores data in commit log as well as disk. Fair enough but if we don't insert data in commit log. It's performance should be the same as Cassandra because it perform I/O to insert data on disk and Cassandra also perform's I/O to insert data on commit log. Is commit log is less expensive ? I didn't really understood the magic :) Would you like to elaborate it more ? Thank you in advance for your time. Looking to hear from you. Regards, Ansar Rafique -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Re: High performance disk io
Hello Christopher, BTW, are you talking about 99th percentiles on client side, or about percentiles from cassandra histograms for CF on cassandra side? Thanks! On 05/22/2013 05:41 PM, Christopher Wirt wrote: Hi Igor, Yea same here, 15ms for 99^th percentile is our max. Currently getting one or two ms for most CF. It goes up at peak times which is what we want to avoid. We're using Cass 1.2.4 w/vnodes and our own barebones driver on top of thrift. Needed to be .NET so Hector and Astyanax were not options. Do you use SSDs or multiple SSDs in any kind of configuration or RAID? Thanks Chris *From:*Igor [mailto:i...@4friends.od.ua] *Sent:* 22 May 2013 15:07 *To:* user@cassandra.apache.org *Subject:* Re: High performance disk io Hello What level of read performance do you expect? We have limit 15 ms for 99 percentile with average read latency near 0.9ms. For some CF 99 percentile actually equals to 2ms, for other - to 10ms, this depends on the data volume you read in each query. Tuning read performance involved cleaning up data model, tuning cassandra.yaml, switching from Hector to astyanax, tuning OS parameters. On 05/22/2013 04:40 PM, Christopher Wirt wrote: Hello, We're looking at deploying a new ring where we want the best possible read performance. We've setup a cluster with 6 nodes, replication level 3, 32Gb of memory, 8Gb Heap, 800Mb keycache, each holding 40/50Gb of data on a 200Gb SSD and 500Gb SATA for OS and commitlog Three column families ColFamily1 50% of the load and data ColFamily2 35% of the load and data ColFamily3 15% of the load and data At the moment we are still seeing around 20% disk utilisation and occasionally as high as 40/50% on some nodes at peak time.. we are conducting some semi live testing. CPU looks fine, memory is fine, keycache hit rate is about 80% (could be better, so maybe we should be increasing the keycache size?) Anyway, we're looking into what we can do to improve this. One conversion we are having at the moment is around the SSD disk setup.. We are considering moving to have 3 smaller SSD drives and spreading the data across those. The possibilities are: -We have a RAID0 of the smaller SSDs and hope that improves performance. Will this acutally yield better throughput? -We mount the SSDs to different directories and define multiple data directories in Cassandra.yaml. Will not having a layer of RAID controller improve the throughput? -We mount the SSDs to different columns family directories and have a single data directory declared in Cassandra.yaml. Think this is quite attractive idea. What are the drawbacks? System column families will be on the main SATA? -We don't change anything and just keep upping our keycache. -Anything you guys can think of. Ideas and thoughts welcome. Thanks for your time and expertise. Chris
RE: High performance disk io
Hi Igor, I was talking about 99th percentile from the Cassandra histograms when I said '1 or 2 ms for most cf'. But we have measured client side too and generally get a couple ms added on top.. as one might expect. Anyone interested - diskio (my original question) we have tried out the multiple SSD setup and found it to work well and reduce the impact of a repair on node performance. We ended up going with the single data directory in cassandra.yaml and mount one SSD against that. Then have a dedicated SSD per large column family. We're now moving all of nodes to have the same setup. Chris From: Igor [mailto:i...@4friends.od.ua] Sent: 23 May 2013 15:00 To: user@cassandra.apache.org Subject: Re: High performance disk io Hello Christopher, BTW, are you talking about 99th percentiles on client side, or about percentiles from cassandra histograms for CF on cassandra side? Thanks! On 05/22/2013 05:41 PM, Christopher Wirt wrote: Hi Igor, Yea same here, 15ms for 99th percentile is our max. Currently getting one or two ms for most CF. It goes up at peak times which is what we want to avoid. We're using Cass 1.2.4 w/vnodes and our own barebones driver on top of thrift. Needed to be .NET so Hector and Astyanax were not options. Do you use SSDs or multiple SSDs in any kind of configuration or RAID? Thanks Chris From: Igor [mailto:i...@4friends.od.ua] Sent: 22 May 2013 15:07 To: user@cassandra.apache.org Subject: Re: High performance disk io Hello What level of read performance do you expect? We have limit 15 ms for 99 percentile with average read latency near 0.9ms. For some CF 99 percentile actually equals to 2ms, for other - to 10ms, this depends on the data volume you read in each query. Tuning read performance involved cleaning up data model, tuning cassandra.yaml, switching from Hector to astyanax, tuning OS parameters. On 05/22/2013 04:40 PM, Christopher Wirt wrote: Hello, We're looking at deploying a new ring where we want the best possible read performance. We've setup a cluster with 6 nodes, replication level 3, 32Gb of memory, 8Gb Heap, 800Mb keycache, each holding 40/50Gb of data on a 200Gb SSD and 500Gb SATA for OS and commitlog Three column families ColFamily1 50% of the load and data ColFamily2 35% of the load and data ColFamily3 15% of the load and data At the moment we are still seeing around 20% disk utilisation and occasionally as high as 40/50% on some nodes at peak time.. we are conducting some semi live testing. CPU looks fine, memory is fine, keycache hit rate is about 80% (could be better, so maybe we should be increasing the keycache size?) Anyway, we're looking into what we can do to improve this. One conversion we are having at the moment is around the SSD disk setup.. We are considering moving to have 3 smaller SSD drives and spreading the data across those. The possibilities are: -We have a RAID0 of the smaller SSDs and hope that improves performance. Will this acutally yield better throughput? -We mount the SSDs to different directories and define multiple data directories in Cassandra.yaml. Will not having a layer of RAID controller improve the throughput? -We mount the SSDs to different columns family directories and have a single data directory declared in Cassandra.yaml. Think this is quite attractive idea. What are the drawbacks? System column families will be on the main SATA? -We don't change anything and just keep upping our keycache. -Anything you guys can think of. Ideas and thoughts welcome. Thanks for your time and expertise. Chris
Re: High performance disk io
I have used both rotation disks with lots of RAM as well as SSD devices. An important thing to consider is that SSD devices are not magic. You have big-o-notation in several places. 1) more data large bloom filters 2) more data (larger key caches) JVM overhead 3) more requests more young gen JVM overhead 4) more data longer compaction (even with ssd) 5) more writes (more memtable flushing) Bottom line: more data more disk seeks We have used both the mid level SSD as well as the costly fusion io. Fit in RAM/VFScache delivers better more predictable low latency, even with very fast disks the average, 95th, and 99th, percentile can get by very far apart. I am currently trying to really study the effect of the width of a row (being in multiple sstables) vs its 95th percentile read time. On Thu, May 23, 2013 at 10:43 AM, Christopher Wirt chris.w...@struq.comwrote: Hi Igor, ** ** I was talking about 99th percentile from the Cassandra histograms when I said ‘1 or 2 ms for most cf’. ** ** But we have measured client side too and generally get a couple ms added on top.. as one might expect. ** ** Anyone interested - diskio (my original question) we have tried out the multiple SSD setup and found it to work well and reduce the impact of a repair on node performance. We ended up going with the single data directory in cassandra.yaml and mount one SSD against that. Then have a dedicated SSD per large column family. We’re now moving all of nodes to have the same setup. ** ** ** ** Chris ** ** *From:* Igor [mailto:i...@4friends.od.ua] *Sent:* 23 May 2013 15:00 *To:* user@cassandra.apache.org *Subject:* Re: High performance disk io ** ** Hello Christopher, BTW, are you talking about 99th percentiles on client side, or about percentiles from cassandra histograms for CF on cassandra side? Thanks! On 05/22/2013 05:41 PM, Christopher Wirt wrote: Hi Igor, Yea same here, 15ms for 99th percentile is our max. Currently getting one or two ms for most CF. It goes up at peak times which is what we want to avoid. We’re using Cass 1.2.4 w/vnodes and our own barebones driver on top of thrift. Needed to be .NET so Hector and Astyanax were not options. Do you use SSDs or multiple SSDs in any kind of configuration or RAID? Thanks Chris *From:* Igor [mailto:i...@4friends.od.ua i...@4friends.od.ua] *Sent:* 22 May 2013 15:07 *To:* user@cassandra.apache.org *Subject:* Re: High performance disk io Hello What level of read performance do you expect? We have limit 15 ms for 99 percentile with average read latency near 0.9ms. For some CF 99 percentile actually equals to 2ms, for other - to 10ms, this depends on the data volume you read in each query. Tuning read performance involved cleaning up data model, tuning cassandra.yaml, switching from Hector to astyanax, tuning OS parameters. On 05/22/2013 04:40 PM, Christopher Wirt wrote: Hello, We’re looking at deploying a new ring where we want the best possible read performance. We’ve setup a cluster with 6 nodes, replication level 3, 32Gb of memory, 8Gb Heap, 800Mb keycache, each holding 40/50Gb of data on a 200Gb SSD and 500Gb SATA for OS and commitlog Three column families ColFamily1 50% of the load and data ColFamily2 35% of the load and data ColFamily3 15% of the load and data At the moment we are still seeing around 20% disk utilisation and occasionally as high as 40/50% on some nodes at peak time.. we are conducting some semi live testing. CPU looks fine, memory is fine, keycache hit rate is about 80% (could be better, so maybe we should be increasing the keycache size?) Anyway, we’re looking into what we can do to improve this. One conversion we are having at the moment is around the SSD disk setup..* *** We are considering moving to have 3 smaller SSD drives and spreading the data across those. The possibilities are: -We have a RAID0 of the smaller SSDs and hope that improves performance. * *** Will this acutally yield better throughput? -We mount the SSDs to different directories and define multiple data directories in Cassandra.yaml. Will not having a layer of RAID controller improve the throughput? -We mount the SSDs to different columns family directories and have a single data directory declared in Cassandra.yaml. Think this is quite attractive idea. What are the drawbacks? System column families will be on the main SATA?** ** -We don’t change anything and just keep upping our keycache. -Anything you guys can think of. Ideas and thoughts welcome. Thanks for your time and expertise. Chris ** **
Re: Cassandra 1.2 TTL histogram problem
Are you sure that it is a good idea to estimate remainingKeys like that? Since we don't want to scan every row to check overlap and cause heavy IO automatically, the method can only do the best-effort type of calculation. In your case, try running user defined compaction on that sstable file. It goes through every row and remove tombstones when droppable. On Wed, May 22, 2013 at 11:48 AM, cem cayiro...@gmail.com wrote: Thanks for the answer. It means that if we use randompartioner it will be very difficult to find a sstable without any overlap. Let me give you an example from my test. I have ~50 sstables in total and an sstable with droppable ratio 0.9. I use GUID for key and only insert (no update -delete) so I dont expect a key in different sstables. I put extra logging to AbstractCompactionStrategy to see the overlaps.size() and keys and remainingKeys: overlaps.size() is around 30, number of keys for that sstable is around 5 M and remainingKeys is always 0. Are you sure that it is a good idea to estimate remainingKeys like that? Best Regards, Cem On Wed, May 22, 2013 at 5:58 PM, Yuki Morishita mor.y...@gmail.com wrote: Can method calculate non-overlapping keys as overlapping? Yes. And randomized keys don't matter here since sstables are sorted by token calculated from key by your partitioner, and the method uses sstable's min/max token to estimate overlap. On Tue, May 21, 2013 at 4:43 PM, cem cayiro...@gmail.com wrote: Thank you very much for the swift answer. I have one more question about the second part. Can method calculate non-overlapping keys as overlapping? I mean it uses max and min tokens and column count. They can be very close to each other if random keys are used. In my use case I generate a GUID for each key and send a single write request. Cem On Tue, May 21, 2013 at 11:13 PM, Yuki Morishita mor.y...@gmail.com wrote: Why does Cassandra single table compaction skips the keys that are in the other sstables? because we don't want to resurrect deleted columns. Say, sstable A has the column with timestamp 1, and sstable B has the same column which deleted at timestamp 2. Then if we purge that column only from sstable B, we would see the column with timestamp 1 again. I also dont understand why we have this line in worthDroppingTombstones method What the method is trying to do is to guess how many columns that are not in the rows that don't overlap, without actually going through every rows in the sstable. We have statistics like column count histogram, min and max row token for every sstables, we use those in the method to estimate how many columns the two sstables overlap. You may have remainingColumnsRatio of 0 when the two sstables overlap almost entirely. On Tue, May 21, 2013 at 3:43 PM, cem cayiro...@gmail.com wrote: Hi all, I have a question about ticket https://issues.apache.org/jira/browse/CASSANDRA-3442 Why does Cassandra single table compaction skips the keys that are in the other sstables? Please correct if I am wrong. I also dont understand why we have this line in worthDroppingTombstones method: double remainingColumnsRatio = ((double) columns) / (sstable.getEstimatedColumnCount().count() * sstable.getEstimatedColumnCount().mean()); remainingColumnsRatio is always 0 in my case and the droppableRatio is 0.9. Cassandra skips all sstables which are already expired. This line was introduced by https://issues.apache.org/jira/browse/CASSANDRA-4022. Best Regards, Cem -- Yuki Morishita t:yukim (http://twitter.com/yukim) -- Yuki Morishita t:yukim (http://twitter.com/yukim) -- Yuki Morishita t:yukim (http://twitter.com/yukim)
Re: exception causes streaming to hang forever
What kind of error does the other end of streaming(/10.10.42.36) say? On Wed, May 22, 2013 at 5:19 PM, Hiller, Dean dean.hil...@nrel.gov wrote: We had 3 nodes roll on good and the next 2, we see a remote node with this exception every time we start over and bootstrap the node ERROR [Streaming to /10.10.42.36:2] 2013-05-22 14:47:59,404 CassandraDaemon.java (line 132) Exception in thread Thread[Streaming to /10.10.42.36:2,5,main] java.lang.RuntimeException: java.io.IOException: Input/output error at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Input/output error at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:405) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:506) at org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:90) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ... 3 more Are there any ideas what this is? Google doesn't real show any useful advice on this and our node has not joined the ring yet so I don't think we can run a repair just yet to avoid it and try synching via another means. It seems on a streaming failure, it never recovers from this. Any ideas? We are on cassandra 1.2.2 Thanks, Dean -- Yuki Morishita t:yukim (http://twitter.com/yukim)
Re: write time of CQL3 set items
Does anyone know I way I could expose the write time of set items? You cannot currently unfortunately. The problem is really just an API one. Since currently you can only ever query a full collection, you cannot apply writeTime() to only an element, and applying it to the whole collection doesn't make sense, in the sense that each element have a write time as you said. We'll likely allow to query individual elements of collections in the future, at which point allowing to get the write time of said individual will work. But let's say that today we just don't have a syntax yet to make it work. -- Sylvain
Re: Creating namespace and column family from multiple nodes concurrently
Hi Arthur and Farraz, Thank you for getting back to me. I am trying to avoid sync among concurrent instances and this is why I am preferring Option - 2. Further in my application, I have reasonable window between the application initialization phase and the application runtime. So as long as Cassandra can safely handle concurrent creation I should be fine. Do you have any idea how Cassandra is going to handle concurrent namespace and column family creation (Here all the instances are going to create the same namespace and column families concurrently)? - Does Cassandra take much time to agree on a final schema (In case if Cassandra is using some sort of exponential back off algorithms to handle schema conflicts) ? - Or is it going to result schema conflicts which needs manual intervention ? - Or will this result in race conditions ? - Or some other issues e.g: memory/ cpu /network bottlenecks ? Thank you Emalayan From: Arthur Zubarev arthur.zuba...@aol.com To: user@cassandra.apache.org; svemala...@yahoo.com Sent: Wednesday, 22 May 2013 8:07 PM Subject: Re: Creating namespace and column family from multiple nodes concurrently I am assuming here you want to sync all the 100s of nodes once the application is airborne. I suspect this would flood the network and even potentially affect the machine itself memory-wise. How are you going to maintain the nodes (compaction+repair)? Regards, Arthur -Original Message- From: Emalayan Vairavanathan svemala...@yahoo.com To: user user@cassandra.apache.org Sent: Wed, May 22, 2013 8:31 pm Subject: Creating namespace and column family from multiple nodes concurrently Hi all, I am implementing a distributed application which runs on 100s of machines concurrently. This application is going to use Cassandra as underlaying storage. The application creates the schema (name space and column families) during initialization phase. It seems I have two options to create the schema. Option - 1 : Using a single node for schema creation. Option - 2: Having all the nodes ( 100) to run the same schema creation logic (First, nodes will check whether the schema is already available and then try to create the schema if it is not available already). To keep the initialization phase simple, I prefer to go for Option - 2. However I am not sure how Cassandra is going to behave if multiple nodes try to create the same schema (namespace and column families) concurrently. It would be nice if someone can tell me about the implications of Option - 2 with Cassandra version 1.2.2. Please let me know if you have question. Thank you VE
Re: Creating namespace and column family from multiple nodes concurrently
Would each device/machine have its own keyspace? Basically, your client needs to take care of a successful creation of the schema and any other verifications and it is going to be time consuming. From: Emalayan Vairavanathan Sent: Thursday, May 23, 2013 3:07 PM To: user@cassandra.apache.org Subject: Re: Creating namespace and column family from multiple nodes concurrently Hi Arthur and Farraz, Thank you for getting back to me. I am trying to avoid sync among concurrent instances and this is why I am preferring Option - 2. Further in my application, I have reasonable window between the application initialization phase and the application runtime. So as long as Cassandra can safely handle concurrent creation I should be fine. Do you have any idea how Cassandra is going to handle concurrent namespace and column family creation (Here all the instances are going to create the same namespace and column families concurrently)? - Does Cassandra take much time to agree on a final schema (In case if Cassandra is using some sort of exponential back off algorithms to handle schema conflicts) ? - Or is it going to result schema conflicts which needs manual intervention ? - Or will this result in race conditions ? - Or some other issues e.g: memory/ cpu /network bottlenecks ? Thank you Emalayan From: Arthur Zubarev arthur.zuba...@aol.com To: user@cassandra.apache.org; svemala...@yahoo.com Sent: Wednesday, 22 May 2013 8:07 PM Subject: Re: Creating namespace and column family from multiple nodes concurrently I am assuming here you want to sync all the 100s of nodes once the application is airborne. I suspect this would flood the network and even potentially affect the machine itself memory-wise. How are you going to maintain the nodes (compaction+repair)? Regards, Arthur -Original Message- From: Emalayan Vairavanathan svemala...@yahoo.com To: user user@cassandra.apache.org Sent: Wed, May 22, 2013 8:31 pm Subject: Creating namespace and column family from multiple nodes concurrently Hi all, I am implementing a distributed application which runs on 100s of machines concurrently. This application is going to use Cassandra as underlaying storage. The application creates the schema (name space and column families) during initialization phase. It seems I have two options to create the schema. Option - 1 : Using a single node for schema creation. Option - 2: Having all the nodes ( 100) to run the same schema creation logic (First, nodes will check whether the schema is already available and then try to create the schema if it is not available already). To keep the initialization phase simple, I prefer to go for Option - 2. However I am not sure how Cassandra is going to behave if multiple nodes try to create the same schema (namespace and column families) concurrently. It would be nice if someone can tell me about the implications of Option - 2 with Cassandra version 1.2.2. Please let me know if you have question. Thank you VE
Re: Creating namespace and column family from multiple nodes concurrently
Would each device/machine have its own keyspace? No. All the machines are going to run the exactly same CQL commands and going to create the same namespace and column families. Thank you Emalayan From: Arthur Zubarev arthur.zuba...@aol.com To: Emalayan Vairavanathan svemala...@yahoo.com; user@cassandra.apache.org Sent: Thursday, 23 May 2013 12:20 PM Subject: Re: Creating namespace and column family from multiple nodes concurrently Would each device/machine have its own keyspace? Basically, your client needs to take care of a successful creation of the schema and any other verifications and it is going to be time consuming. From: Emalayan Vairavanathan Sent: Thursday, May 23, 2013 3:07 PM To: user@cassandra.apache.org Subject: Re: Creating namespace and column family from multiple nodes concurrently Hi Arthur and Farraz, Thank you for getting back to me. I am trying to avoid sync among concurrent instances and thisis why I am preferring Option - 2. Further in my application, I have reasonable window between the application initialization phase and the application runtime. So as long as Cassandra can safely handle concurrent creation I should be fine. Do you have any idea how Cassandra is going to handle concurrent namespace and column family creation (Here all the instances are going to create the same namespace and column families concurrently)? - Does Cassandra take much time to agree on a final schema (In case if Cassandra is using some sort of exponential back off algorithms to handle schema conflicts) ? - Or is it going to result schema conflicts which needs manual intervention ? - Or will this result in race conditions ? - Or some other issues e.g: memory/ cpu /network bottlenecks ? Thank you Emalayan From: Arthur Zubarev arthur.zuba...@aol.com To: user@cassandra.apache.org; svemala...@yahoo.com Sent: Wednesday, 22 May 2013 8:07 PM Subject: Re: Creating namespace and column family from multiple nodes concurrently I am assuming here you want to sync all the 100s of nodes once the application is airborne. I suspect this would flood the network and even potentially affect the machine itself memory-wise. How are you going to maintain the nodes (compaction+repair)? Regards, Arthur -Original Message- From: Emalayan Vairavanathan svemala...@yahoo.com To: user user@cassandra.apache.org Sent: Wed, May 22, 2013 8:31 pm Subject: Creating namespace and column family from multiple nodes concurrently Hi all, I am implementing a distributed application which runs on 100s of machines concurrently. This application is going to use Cassandra as underlaying storage. The application creates the schema (name space and column families) during initialization phase. It seems I have two options to create the schema. Option - 1 : Using a single node for schema creation. Option - 2: Having all the nodes ( 100) to run the same schema creation logic (First, nodes will check whether the schema is already available and then try to create the schema if it is not available already). To keep the initialization phase simple, I prefer to go for Option - 2. However I am not sure how Cassandra is going to behave if multiple nodes try to create the same schema (namespace and column families) concurrently. It would be nice if someone can tell me about the implications of Option - 2 with Cassandra version 1.2.2. Please let me know if you have question. Thank you VE
Re: Creating namespace and column family from multiple nodes concurrently
On Thu, May 23, 2013 at 12:07 PM, Emalayan Vairavanathan svemala...@yahoo.com wrote: Do you have any idea how Cassandra is going to handle concurrent namespace and column family creation (Here all the instances are going to create the same namespace and column families concurrently)? [...] However I am not sure how Cassandra is going to behave if multiple nodes try to create the same schema (namespace and column families) concurrently. It would be nice if someone can tell me about the implications of Option - 2 with Cassandra version 1.2.2. Concurrent CREATE is allegedly working in 1.2.0, per NEWS.txt [1]. I say allegedly working because this feature was also allegedly working in 1.1.0. Given past experience, I continue to (perhaps pessimistically) believe that frequent dynamic updates of schema are likely to result in schema desynch. I would be interested to hear if you go down this route and do not encounter problems. See also CASSANDRA-3794 [2] for details. =Rob [1] https://github.com/apache/cassandra/blob/cassandra-1.2/NEWS.txt [2] https://issues.apache.org/jira/browse/CASSANDRA-3794
Re: Creating namespace and column family from multiple nodes concurrently
so where the multiple nodes are? I am just puzzled From: Emalayan Vairavanathan Sent: Thursday, May 23, 2013 3:43 PM To: Arthur Zubarev ; user@cassandra.apache.org Subject: Re: Creating namespace and column family from multiple nodes concurrently Would each device/machine have its own keyspace? No. All the machines are going to run the exactly same CQL commands and going to create the same namespace and column families. Thank you Emalayan From: Arthur Zubarev arthur.zuba...@aol.com To: Emalayan Vairavanathan svemala...@yahoo.com; user@cassandra.apache.org Sent: Thursday, 23 May 2013 12:20 PM Subject: Re: Creating namespace and column family from multiple nodes concurrently Would each device/machine have its own keyspace? Basically, your client needs to take care of a successful creation of the schema and any other verifications and it is going to be time consuming. From: Emalayan Vairavanathan Sent: Thursday, May 23, 2013 3:07 PM To: user@cassandra.apache.org Subject: Re: Creating namespace and column family from multiple nodes concurrently Hi Arthur and Farraz, Thank you for getting back to me. I am trying to avoid sync among concurrent instances and this is why I am preferring Option - 2. Further in my application, I have reasonable window between the application initialization phase and the application runtime. So as long as Cassandra can safely handle concurrent creation I should be fine. Do you have any idea how Cassandra is going to handle concurrent namespace and column family creation (Here all the instances are going to create the same namespace and column families concurrently)? - Does Cassandra take much time to agree on a final schema (In case if Cassandra is using some sort of exponential back off algorithms to handle schema conflicts) ? - Or is it going to result schema conflicts which needs manual intervention ? - Or will this result in race conditions ? - Or some other issues e.g: memory/ cpu /network bottlenecks ? Thank you Emalayan From: Arthur Zubarev arthur.zuba...@aol.com To: user@cassandra.apache.org; svemala...@yahoo.com Sent: Wednesday, 22 May 2013 8:07 PM Subject: Re: Creating namespace and column family from multiple nodes concurrently I am assuming here you want to sync all the 100s of nodes once the application is airborne. I suspect this would flood the network and even potentially affect the machine itself memory-wise. How are you going to maintain the nodes (compaction+repair)? Regards, Arthur -Original Message- From: Emalayan Vairavanathan svemala...@yahoo.com To: user user@cassandra.apache.org Sent: Wed, May 22, 2013 8:31 pm Subject: Creating namespace and column family from multiple nodes concurrently Hi all, I am implementing a distributed application which runs on 100s of machines concurrently. This application is going to use Cassandra as underlaying storage. The application creates the schema (name space and column families) during initialization phase. It seems I have two options to create the schema. Option - 1 : Using a single node for schema creation. Option - 2: Having all the nodes ( 100) to run the same schema creation logic (First, nodes will check whether the schema is already available and then try to create the schema if it is not available already). To keep the initialization phase simple, I prefer to go for Option - 2. However I am not sure how Cassandra is going to behave if multiple nodes try to create the same schema (namespace and column families) concurrently. It would be nice if someone can tell me about the implications of Option - 2 with Cassandra version 1.2.2. Please let me know if you have question. Thank you VE
Re: column with TTL of 10 seconds lives very long...
On Wed, May 22, 2013 at 11:32 PM, Tamar Fraenkel ta...@tok-media.comwrote: I am using Hector HLockManagerImpl, which creates a keyspace named HLockManagerImpl and CF HLocks. For some reason I have a row with single column that should have expired yesterday who is still there. I tried deleting it using cli, but it is stuck... Any ideas how to delete it? is still there is sorta ambiguous. Do you mean that clients see it or that it is still in the (immutable) data file it was previously in? If the latter, what is gc_grace_seconds set to? Make sure it's set to a low value and then make sure that your TTL-expired key is compacted? =Rob
Re: Cassandra read reapair
If you are reading and writing at CL QUOURM and getting inconsistent results that sounds like a bug. If you are mixing the CL levels such that R + W = N then it's expected behaviour. Can you reproduce the issue outside of your app ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 21/05/2013, at 8:55 PM, Kais Ahmed k...@neteck-fr.com wrote: Checking you do not mean the row key is corrupt and cannot be read. Yes, i can read it but all read don't return the same result except for CL ALL By default in 1.X and beyond the default read repair chance is 0.1, so it's only enabled on 10% of requests. You are right read repair chance is set to 0.1, but i launched a read repair which did not solved the problem. Any idea? What CL are you writing at ? All write are in CL QUORUM thank you aaron for your answer. 2013/5/21 aaron morton aa...@thelastpickle.com Only some keys of one CF are corrupt. Checking you do not mean the row key is corrupt and cannot be read. I thought using CF ALL, would correct the problem with READ REPAIR, but by returning to CL QUORUM, the problem persists. By default in 1.X and beyond the default read repair chance is 0.1, so it's only enabled on 10% of requests. In the absence of further writes all reads (at any CL) should return the same value. What CL are you writing at ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 19/05/2013, at 1:28 AM, Kais Ahmed k...@neteck-fr.com wrote: Hi all, I encountered a consistency problem one some keys using phpcassa and Cassandra 1.2.3 since a server crash Only some keys of one CF are corrupt. I lauched a nodetool repair that successfully completed but don't correct the issue. When i try to get a corrupt Key with : CL ONE, the result contains 7 or 8 or 9 columns CL QUORUM, result contains 8 or 9 columns CL ALL, the data is consistent and returns always 9 columns I thought using CF ALL, would correct the problem with READ REPAIR, but by returning to CL QUORUM, the problem persists. Thank you for your help
Re: Cassandra hangs on large hinted handoffs
For some reason the 1.0.7 hints actually use a super column :) On Thu, May 23, 2013 at 6:18 PM, aaron morton aa...@thelastpickle.comwrote: I know how this sounds, but upgrading to 1.1.11 is the best approach. 1.0X is not getting any fixes, 1.1X is the most stable and still getting some patches, and 1.2 is stable and in use. Hint storage has been redesigned in 1.2. Any suggestions on how to make the cluster more tolerant to downtimes? Hints are always seen as an optimisation, their success or otherwise does not impact the consistency guarantees. If are you dealing with a very high throughput as a work around you can reduce the time that hints are stored for a down node, see the yaml file for info. The behaviour is changes if you have lots of small or large column, this is the from HintedHandoff manager that selects the page size. int pageSize = PAGE_SIZE; // read less columns (mutations) per page if they are very large if (hintStore.getMeanColumns() 0) { int averageColumnSize = (int) (hintStore.getMeanRowSize() / hintStore.getMeanColumns()); pageSize = Math.min(PAGE_SIZE, DatabaseDescriptor.getInMemoryCompactionLimit() / averageColumnSize); pageSize = Math.max(2, pageSize); // page size of 1 does not allow actual paging b/c of = behavior on startColumn logger_.debug(average hinted-row column size is {}; using pageSize of {}, averageColumnSize, pageSize); } If you reduce the in_memory_compaction_limit yaml setting that would reduce the page size Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 21/05/2013, at 9:26 PM, Vladimir Volkov vlad.vol...@gmail.com wrote: Hello. I'm stress-testing our Cassandra (version 1.0.9) cluster, and tried turning off two of the four nodes for half an hour under heavy load. As a result I got a large volume of hints on the alive nodes - HintsColumnFamily takes about 1.5 GB disk space on each of the nodes. It seems, these hints are never replayed successfully. After I bring other nodes back online, tpstats shows active handoffs, but I can't see any writes on the target nodes. The log indicates memory pressure - the heap is 80% full (heap size is 8GB total, 1GB young). A fragment of the log: INFO 18:34:05,513 Started hinted handoff for token: 1 with IP: / 84.201.162.144 INFO 18:34:06,794 GC for ParNew: 300 ms for 1 collections, 5974181760 used; max is 8588951552 INFO 18:34:07,795 GC for ParNew: 263 ms for 1 collections, 6226018744 used; max is 8588951552 INFO 18:34:08,795 GC for ParNew: 256 ms for 1 collections, 6559918392 used; max is 8588951552 INFO 18:34:09,796 GC for ParNew: 231 ms for 1 collections, 6846133712 used; max is 8588951552 WARN 18:34:09,805 Heap is 0.7978131149667941 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. WARN 18:34:09,805 Flushing CFS(Keyspace='test', ColumnFamily='t2') to relieve memory pressure INFO 18:34:09,806 Enqueuing flush of Memtable-t2@639524673(60608588/571839171 serialized/live bytes, 743266 ops) INFO 18:34:09,807 Writing Memtable-t2@639524673(60608588/571839171 serialized/live bytes, 743266 ops) INFO 18:34:11,018 GC for ParNew: 449 ms for 2 collections, 6573394480used; max is 8588951552 INFO 18:34:12,019 GC for ParNew: 265 ms for 1 collections, 6820930056 used; max is 8588951552 INFO 18:34:13,112 GC for ParNew: 331 ms for 1 collections, 6900566728 used; max is 8588951552 INFO 18:34:14,181 GC for ParNew: 269 ms for 1 collections, 7101358936 used; max is 8588951552 INFO 18:34:14,691 Completed flushing /mnt/raid/cassandra/data/test/t2-hc-244-Data.db (56156246 bytes) INFO 18:34:15,381 GC for ParNew: 280 ms for 1 collections, 7268441248 used; max is 8588951552 INFO 18:34:35,306 InetAddress /84.201.162.144 is now dead. INFO 18:34:35,306 GC for ConcurrentMarkSweep: 19223 ms for 1 collections, 3774714808 used; max is 8588951552 INFO 18:34:35,309 InetAddress /84.201.162.144 is now UP After taking off the load and restatring the service, I still see pending handoffs: $ nodetool -h localhost tpstats Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 01004257 0 0 RequestResponseStage 0 0 92555 0 0 MutationStage 0 0 6 0 0 ReadRepairStage 0 0 57773 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 143332 0 0 AntiEntropyStage 0 0 0 0 0 MigrationStage
Re: For those using Cassandra from .Net
Thanks, when and were is the talk ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 23/05/2013, at 6:42 AM, Peter Lin wool...@gmail.com wrote: NativeX is giving a talk about using Cassandra with .Net. Our firm created a port of Hector over to .Net late last year. Here is the abstract. The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop Speakers: Derek Bromenshenkel and Jeff Smoley, Infrastructure Architects at NativeX NativeX (formerly W3i) recently transitioned a large portion of their backend infrastructure from Microsoft SQL Server to Apache Cassandra. Today, its Cassandra cluster backs its mobile advertising network supporting over 10 million daily active users that produce over 10,000 transactions per second with an average database request latency of under 2 milliseconds. Come hear our story about how we were successful at getting our .NET web apps to reliably connect to Cassandra. Come learn about FluentCassandra, Snowflake, Hector, and IKVM. It's a story of struggle and perseverance, where everyone lives happily ever after.
Re: High performance disk io
I am currently trying to really study the effect of the width of a row (being in multiple sstables) vs its 95th percentile read time. I'd be interested to see your findings. Is use 3+ SSTables per read as (from cfhistograms) as a warning sign to dig deeper in the data model. Also the type of query impacts on the number of SSTables per read, queries by column name can short circuit and may be served from (say) 0 or 1 sstables even if the row is spread out. -We don’t change anything and just keep upping our keycache. 800MB is a very high key cache and may result in poor GC performance which is ultimately going to hurt your read latency. Pay attention to what GC is doing, both ParNew and CMS and reduce the key cache if needed. When ParNew runs the server is stalled. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 24/05/2013, at 3:16 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I have used both rotation disks with lots of RAM as well as SSD devices. An important thing to consider is that SSD devices are not magic. You have big-o-notation in several places. 1) more data large bloom filters 2) more data (larger key caches) JVM overhead 3) more requests more young gen JVM overhead 4) more data longer compaction (even with ssd) 5) more writes (more memtable flushing) Bottom line: more data more disk seeks We have used both the mid level SSD as well as the costly fusion io. Fit in RAM/VFScache delivers better more predictable low latency, even with very fast disks the average, 95th, and 99th, percentile can get by very far apart. I am currently trying to really study the effect of the width of a row (being in multiple sstables) vs its 95th percentile read time. On Thu, May 23, 2013 at 10:43 AM, Christopher Wirt chris.w...@struq.com wrote: Hi Igor, I was talking about 99th percentile from the Cassandra histograms when I said ‘1 or 2 ms for most cf’. But we have measured client side too and generally get a couple ms added on top.. as one might expect. Anyone interested - diskio (my original question) we have tried out the multiple SSD setup and found it to work well and reduce the impact of a repair on node performance. We ended up going with the single data directory in cassandra.yaml and mount one SSD against that. Then have a dedicated SSD per large column family. We’re now moving all of nodes to have the same setup. Chris From: Igor [mailto:i...@4friends.od.ua] Sent: 23 May 2013 15:00 To: user@cassandra.apache.org Subject: Re: High performance disk io Hello Christopher, BTW, are you talking about 99th percentiles on client side, or about percentiles from cassandra histograms for CF on cassandra side? Thanks! On 05/22/2013 05:41 PM, Christopher Wirt wrote: Hi Igor, Yea same here, 15ms for 99th percentile is our max. Currently getting one or two ms for most CF. It goes up at peak times which is what we want to avoid. We’re using Cass 1.2.4 w/vnodes and our own barebones driver on top of thrift. Needed to be .NET so Hector and Astyanax were not options. Do you use SSDs or multiple SSDs in any kind of configuration or RAID? Thanks Chris From: Igor [mailto:i...@4friends.od.ua] Sent: 22 May 2013 15:07 To: user@cassandra.apache.org Subject: Re: High performance disk io Hello What level of read performance do you expect? We have limit 15 ms for 99 percentile with average read latency near 0.9ms. For some CF 99 percentile actually equals to 2ms, for other - to 10ms, this depends on the data volume you read in each query. Tuning read performance involved cleaning up data model, tuning cassandra.yaml, switching from Hector to astyanax, tuning OS parameters. On 05/22/2013 04:40 PM, Christopher Wirt wrote: Hello, We’re looking at deploying a new ring where we want the best possible read performance. We’ve setup a cluster with 6 nodes, replication level 3, 32Gb of memory, 8Gb Heap, 800Mb keycache, each holding 40/50Gb of data on a 200Gb SSD and 500Gb SATA for OS and commitlog Three column families ColFamily1 50% of the load and data ColFamily2 35% of the load and data ColFamily3 15% of the load and data At the moment we are still seeing around 20% disk utilisation and occasionally as high as 40/50% on some nodes at peak time.. we are conducting some semi live testing. CPU looks fine, memory is fine, keycache hit rate is about 80% (could be better, so maybe we should be increasing the keycache size?) Anyway, we’re looking into what we can do to improve this. One conversion we are having at the moment is around the SSD disk setup.. We are considering moving to have 3 smaller SSD drives and spreading the data across those. The possibilities
Re: Creating namespace and column family from multiple nodes concurrently
I am sorry if I was not clear. I was using nodes to refer machines (or vice versa). Let me put in another way... The application is composed of multiple instances of an executable. The application runs on multiple machines concurrently. All the instances are going to issue the same CQL command to and try to create exactly same namespace and column families. Thank you Emalayan From: Arthur Zubarev arthur.zuba...@aol.com To: Emalayan Vairavanathan svemala...@yahoo.com; user@cassandra.apache.org Sent: Thursday, 23 May 2013 1:15 PM Subject: Re: Creating namespace and column family from multiple nodes concurrently so where the multiple nodes are? I am just puzzled From: Emalayan Vairavanathan Sent: Thursday, May 23, 2013 3:43 PM To: Arthur Zubarev ; user@cassandra.apache.org Subject: Re: Creating namespace and column family from multiple nodes concurrently Would each device/machine have its own keyspace? No. All the machines are going to run the exactly same CQL commands and going to create the same namespace and column families. Thank you Emalayan From: Arthur Zubarev arthur.zuba...@aol.com To: Emalayan Vairavanathan svemala...@yahoo.com; user@cassandra.apache.org Sent: Thursday, 23 May 2013 12:20 PM Subject: Re: Creating namespace and column family from multiple nodes concurrently Would each device/machine have its own keyspace? Basically, your client needs to take care of a successful creation of the schema and any other verifications and it is going to be time consuming. From: Emalayan Vairavanathan Sent: Thursday, May 23, 2013 3:07 PM To: user@cassandra.apache.org Subject: Re: Creating namespace and column family from multiple nodes concurrently Hi Arthur and Farraz, Thank you for getting back to me. I am trying to avoid sync among concurrent instances and thisis why I am preferring Option - 2. Further in my application, I have reasonable window between the application initialization phase and the application runtime. So as long as Cassandra can safely handle concurrent creation I should be fine. Do you have any idea how Cassandra is going to handle concurrent namespace and column family creation (Here all the instances are going to create the same namespace and column families concurrently)? - Does Cassandra take much time to agree on a final schema (In case if Cassandra is using some sort of exponential back off algorithms to handle schema conflicts) ? - Or is it going to result schema conflicts which needs manual intervention ? - Or will this result in race conditions ? - Or some other issues e.g: memory/ cpu /network bottlenecks ? Thank you Emalayan From: Arthur Zubarev arthur.zuba...@aol.com To: user@cassandra.apache.org; svemala...@yahoo.com Sent: Wednesday, 22 May 2013 8:07 PM Subject: Re: Creating namespace and column family from multiple nodes concurrently I am assuming here you want to sync all the 100s of nodes once the application is airborne. I suspect this would flood the network and even potentially affect the machine itself memory-wise. How are you going to maintain the nodes (compaction+repair)? Regards, Arthur -Original Message- From: Emalayan Vairavanathan svemala...@yahoo.com To: user user@cassandra.apache.org Sent: Wed, May 22, 2013 8:31 pm Subject: Creating namespace and column family from multiple nodes concurrently Hi all, I am implementing a distributed application which runs on 100s of machines concurrently. This application is going to use Cassandra as underlaying storage. The application creates the schema (name space and column families) during initialization phase. It seems I have two options to create the schema. Option - 1 : Using a single node for schema creation. Option - 2: Having all the nodes ( 100) to run the same schema creation logic (First, nodes will check whether the schema is already available and then try to create the schema if it is not available already). To keep the initialization phase simple, I prefer to go for Option - 2. However I am not sure how Cassandra is going to behave if multiple nodes try to create the same schema (namespace and column families) concurrently. It would be nice if someone can tell me about the implications of Option - 2 with Cassandra version 1.2.2. Please let me know if you have question. Thank you VE