Re: C* 2.1.2 invokes oom-killer
After couple of days it's still behaving fine. Case closed. On Thu, Feb 19, 2015 at 11:15 PM, Michał Łowicki mlowi...@gmail.com wrote: Upgrade to 2.1.3 seems to help so far. After ~12 hours total memory consumption grew from 10GB to 10.5GB. On Thu, Feb 19, 2015 at 2:02 PM, Carlos Rolo r...@pythian.com wrote: Then you are probably hitting a bug... Trying to find out in Jira. The bad news is the fix is only to be released on 2.1.4. Once I find it out I will post it here. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Thu, Feb 19, 2015 at 12:16 PM, Michał Łowicki mlowi...@gmail.com wrote: |trickle_fsync| has been enabled for long time in our settings (just noticed): trickle_fsync: true trickle_fsync_interval_in_kb: 10240 On Thu, Feb 19, 2015 at 12:12 PM, Michał Łowicki mlowi...@gmail.com wrote: On Thu, Feb 19, 2015 at 11:02 AM, Carlos Rolo r...@pythian.com wrote: Do you have trickle_fsync enabled? Try to enable that and see if it solves your problem, since you are getting out of non-heap memory. Another question, is always the same nodes that die? Or is 2 out of 4 that die? Always the same nodes. Upgraded to 2.1.3 two hours ago so we'll monitor if maybe issue has been fixed there. If not will try to enable |tricke_fsync| Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Thu, Feb 19, 2015 at 10:49 AM, Michał Łowicki mlowi...@gmail.com wrote: On Thu, Feb 19, 2015 at 10:41 AM, Carlos Rolo r...@pythian.com wrote: So compaction doesn't seem to be your problem (You can check with nodetool compactionstats just to be sure). pending tasks: 0 How much is your write latency on your column families? I had OOM related to this before, and there was a tipping point around 70ms. Write request latency is below 0.05 ms/op (avg). Checked with OpsCenter. -- -- BR, Michał Łowicki -- -- BR, Michał Łowicki -- BR, Michał Łowicki -- -- BR, Michał Łowicki -- BR, Michał Łowicki
Commitlog activities
Hi! I have the following keyspaces cqlsh SELECT * FROM system.schema_keyspaces; keyspace_name | durable_writes | strategy_class | strategy_options ---++-+ system | True| org.apache.cassandra.locator.LocalStrategy | {} system_traces | False| org.apache.cassandra.locator.SimpleStrategy | {replication_factor:2} a1_ks | False| org.apache.cassandra.locator.SimpleStrategy | {replication_factor:1} I have two disks. Data directory is on sda. Commitlog is on sdb. I do 100% writes into a1_ks.user_table . Watching IO activities I noticed that C* write something (mutations) into commitlog. But it's strange because I disabled durable writes for a1_ks. May be it's system activities and they are flushed into commitlog? -- Thanks, Serj
Re: Why no virtual nodes for Cassandra on EC2?
Hi mck, I'm not familiar with this ticket, but my understanding was that performance of Hadoop jobs on C* clusters with vnodes was poor because a given Hadoop input split has to run many individual scans (one for each vnode) rather than just a single scan. I've run C* and Hadoop in production with a custom input format that used vnodes (and just combined multiple vnodes in a single input split) and didn't have any issues (the jobs had many other performance bottlenecks besides starting multiple scans from C*). This is one of the videos where I recall an off-hand mention of the Spark connector working with vnodes: https://www.youtube.com/watch?v=1NtnrdIUlg0 Best regards, Clint On Sat, Feb 21, 2015 at 2:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Running Cassandra + Spark on AWS - architecture questions
These are both good suggestions, thanks! I thought I had remembered reading that different virtual datacenters should always have the same number of nodes. I think I was mistaken about that. In the past we had major issues running huge analytics jobs on data stored in HBase (it would bring down our real-time performance), so this capability of Cassandra is great! Best regards, Clint On Sun, Feb 22, 2015 at 8:02 AM, Eric Stevens migh...@gmail.com wrote: I'm not sure if this is a good use case for you, but you might also consider setting up several keyspaces, one for any data you want available for analytics (such as business object tables), and one for data you don't want to do analytics on (such as custom secondary indices). Maybe a third one for data which should only exist in the analytics space, such as for temporary rollup data. This can reduce the amount of data you replicate into your analytics space, and allow you to run a smaller analytics cluster than your production cluster. On Fri, Feb 20, 2015 at 2:43 PM, DuyHai Doan doanduy...@gmail.com wrote: Cassandra would take care of keeping the data synced between the two sets of five nodes. Is that correct? Correct But doing so means that we need 2x as many nodes as we need for the real-time cluster alone Not necessarily. With multi DC you can configure the replication factor value per DC, meaning that you can have RF = 3 for the real time DC and RF=1 or RF=2 for the analytics DC. Thus the number of nodes can be different for each DC In addition, you can also tune the hardware. If the realtime DC is mostly write only for incoming data and read-only from aggregated table, it is less IO intensive than the analytics DC with lot of read write to compute aggregations. On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed workload Cassandra + Spark installation would look like, especially on AWS. What I gather is that you use OpsCenter to set up the following: - One virtual data center for real-time processing (e.g., ingestion of time-series data, replying to requests for an interactive application) - Another virtual data center for batch analytics (Spark, possibly for machine learning) If I understand this correctly, if I estimate that I need a five-node cluster to handle all of my data, under the system described above, I would have five nodes serving real-time traffic and all of the data replicated in another five nodes that I use for batch processing. Cassandra would take care of keeping the data synced between the two sets of five nodes. Is that correct? I assume the motivation for such a dual-virtual-data-center architecture is to prevent the Spark jobs (which are going to do lots of scans from Cassandra, and maybe run computation on the same machines hosting Cassandra) from disrupting the real-time performance. But doing so means that we need 2x as many nodes as we need for the real-time cluster alone. *Could someone confirm that my interpretation above of what I read about in the DSE documentation is correct?* If my application needs to run analytics on Spark only a few hours a day, might we be better off spending our money to get a bigger Cassandra cluster and then just spin up Spark jobs on EMR for a few hours at night? (I know this is a hard question to answer, since it all depends on the application---just curious if anyone else here has had to make similar tradeoffs.) e.g., maybe instead of having a five-node real-time cluster, we would have an eight-node real-time cluster, and use our remaining budget on EMR jobs. I am curious if anyone has any thoughts / experience about this. Best regards, Clint
Any notion of unions in C* user-defined types?
Hi all, I am building an application that keeps a time-series record of clickstream data (clicks, impressions, etc.). The data model looks something like: CREATE TABLE clickstream ( userid text, event_time timestamp, interaction frozen interaction_type, PRIMARY KEY (userid, timestamp) ) WITH CLUSTERING ORDER BY (event_time DESC); I would like to create a user-defined type interaction_type such that it can be different depending on whether the interaction was a click, view, etc. Previously we encoded such data with Avro, using Avro's unions ( http://avro.apache.org/docs/1.7.5/idl.html#unions) and encoded the data as blobs. I was hoping to get away from blobs now that we have UDTs in Cassandra 2.1, but I don't see any support for unions. Does anyone have any suggestions? I think I may be better of just sticking with Avro serialization. :( Best regards, Clint
Re: Why no virtual nodes for Cassandra on EC2?
… my understanding was that performance of Hadoop jobs on C* clusters with vnodes was poor because a given Hadoop input split has to run many individual scans (one for each vnode) rather than just a single scan. I've run C* and Hadoop in production with a custom input format that used vnodes (and just combined multiple vnodes in a single input split) and didn't have any issues (the jobs had many other performance bottlenecks besides starting multiple scans from C*). You've described the ticket, and how it has been solved :-) This is one of the videos where I recall an off-hand mention of the Spark connector working with vnodes: https://www.youtube.com/watch?v=1NtnrdIUlg0 Thanks. ~mck
Re: Why no virtual nodes for Cassandra on EC2?
Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for Search workloads is now no longer officially discouraged. As per the official doc for improvements, : *Ability to use virtual nodes (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%)*. A vnode token count of 64 or 32 would reduce that overhead further. And... the new 4.6 feature of being able to direct a Solr query to a specific partition essentially eliminates that overhead entirely. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote: Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
Thanks for pointing out a mistake in the doc - that statement (for Search/Solr) was simply a leftover from before 4.6. Besides, it's in the Analytics section, which is not relevant for Search/Solr anyway. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote: 30% overhead is pretty brutal. I think this is basic support for it, and not necessarily a recommendation to use it. From http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes *DataStax does not recommend turning on vnodes *for other Hadoop use cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on Hadoop nodes, disable virtual nodes before using the cluster. On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com wrote: DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for Search workloads is now no longer officially discouraged. As per the official doc for improvements, : *Ability to use virtual nodes (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%)*. A vnode token count of 64 or 32 would reduce that overhead further. And... the new 4.6 feature of being able to direct a Solr query to a specific partition essentially eliminates that overhead entirely. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote: Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
build failure with cassandra 2.0.12
Hi, I am experiencing build failure with cassandra 2.0.12. I downloaded source from http://cassandra.apache.org/download/, did ant mvn-install and got following error: [artifact:dependencies] -- [artifact:dependencies] 1 required artifact is missing. [artifact:dependencies] [artifact:dependencies] for artifact: [artifact:dependencies] org.apache.cassandra:cassandra-coverage-deps:jar:2.0.12-SNAPSHOT [artifact:dependencies] [artifact:dependencies] from the specified remote repositories: [artifact:dependencies] central (http://repo1.maven.org/maven2) [artifact:dependencies] [artifact:dependencies] BUILD FAILED /Users/chengren/br/thirdparty/cassandra-2.0.12-br/build.xml:541: Unable to resolve artifact: Missing: -- 1) com.sun:tools:jar:0 Try downloading the file manually from the project website. Then, install it using the command: mvn install:install-file -DgroupId=com.sun -DartifactId=tools -Dversion=0 -Dpackaging=jar -Dfile=/path/to/file Alternatively, if you host your own repository you can deploy the file there: mvn deploy:deploy-file -DgroupId=com.sun -DartifactId=tools -Dversion=0 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id] Path to dependency: 1) org.apache.cassandra:cassandra-coverage-deps:jar:2.0.12-SNAPSHOT 2) net.sourceforge.cobertura:cobertura:jar:2.0.3 3) com.sun:tools:jar:0 -- 1 required artifact is missing. for artifact: org.apache.cassandra:cassandra-coverage-deps:jar:2.0.12-SNAPSHOT from the specified remote repositories: central (http://repo1.maven.org/maven2) It occurs really weird to me since I could build successfully yesterday. Did something change behind the scene? Thanks!
Re: run cassandra on a small instance
You might also have some gains setting in_memory_compaction_limit_in_mb to something very low to force Cassandra to use on disk compaction rather than doing it in memory. Cool Ben.. thanks I'll add that to my config as well. Glad that helped. Thanks for reporting back! No problem, Nate! That's the least I can do. All I can hope is that this thread adds to the overall fund of knowledge for the list. Cheers, Tim On Mon, Feb 23, 2015 at 11:46 AM, Nate McCall n...@thelastpickle.com wrote: Glad that helped. Thanks for reporting back! On Sun, Feb 22, 2015 at 9:12 PM, Tim Dunphy bluethu...@gmail.com wrote: Nate, Definitely thank you for this advice. After leaving the new Cassandra node running on the 2GB instance for the past couple of days, I think I've had ample reason to report complete success in getting it stabilized on that instance! Here are the changes I've been able to make: I think manipulating the key cache and other stuff like concurrent writes and some of the other stuff I worked on based on that thread from the cassandra list definitely was key in getting Cassandra to work on the new instance. Check out the before and after (before working/ after working): Before in cassandra-env.sh: MAX_HEAP_SIZE=800M HEAP_NEWSIZE=200M After: MAX_HEAP_SIZE=512M HEAP_NEWSIZE=100M And before in the cassandra.yaml file: concurrent_writes: 32 compaction_throughput_mb_per_sec: 16 key_cache_size_in_mb: key_cache_save_period: 14400 # native_transport_max_threads: 128 And after: concurrent_writes: 2 compaction_throughput_mb_per_sec: 8 key_cache_size_in_mb: 4 key_cache_save_period: 0 native_transport_max_threads: 4 That really made the difference. I'm a puppet user, so these changes are in puppet. So any new 2GB instances I should bring up on Digital Ocean should absolutely work the way the first 2GB node does, there. But I was able to make enough sense of your chef recipe to adapt what you were showing me. Thanks again! Tim On Fri, Feb 20, 2015 at 10:31 PM, Tim Dunphy bluethu...@gmail.com wrote: The most important things to note: - don't include JNA (it needs to lock pages larger than what will be available) - turn down threadpools for transports - turn compaction throughput way down - make concurrent reads and writes very small I have used the above run a healthy 5 node clusters locally in it's own private network with a 6th monitoring server for light to moderate local testing in 16g of laptop ram. YMMV but it is possible. Thanks!! That was very helpful. I just tried applying your suggestions to my cassandra.yaml file. I used the info from your chef recipe. Well like I've been saying typically it takes about 5 hours or so for this situation to shake itself out. I'll provide an update to the list once I have a better idea of how this is working. Thanks again! Tim On Fri, Feb 20, 2015 at 9:37 PM, Nate McCall n...@thelastpickle.com wrote: I frequently test with multi-node vagrant-based clusters locally. The following chef attributes should give you an idea of what to turn down in cassandra.yaml and cassandra-env.sh to build a decent testing cluster: :cassandra = {'cluster_name' = 'VerifyCluster', 'package_name' = 'dsc20', 'version' = '2.0.11', 'release' = '1', 'setup_jna' = false, 'max_heap_size' = '512M', 'heap_new_size' = '100M', 'initial_token' = server['initial_token'], 'seeds' = 192.168.33.10, 'listen_address' = server['ip'], 'broadcast_address' = server['ip'], 'rpc_address' = server['ip'], 'conconcurrent_reads' = 2, 'concurrent_writes' = 2, 'memtable_flush_queue_size' = 2, 'compaction_throughput_mb_per_sec' = 8, 'key_cache_size_in_mb' = 4, 'key_cache_save_period' = 0, 'native_transport_min_threads' = 2, 'native_transport_max_threads' = 4, 'notify_restart' = true, 'reporter' = { 'riemann' = { 'enable' = true, 'host' = '192.168.33.51' }, 'graphite' = { 'enable' = true, 'host' = '192.168.33.51' } } }, The most important things to note: - don't include JNA (it needs to lock pages larger than what will be available) - turn down threadpools for
Re: Why no virtual nodes for Cassandra on EC2?
That link is the one from the 4.6 New Features page: http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html - Ability to use virtual nodes (vnodes) http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html#anaNdeOps__implicationsVnodes in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%) Anyway, thanks for clearing this up Jack. This overhead is on queries only, right? On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Thanks for pointing out a mistake in the doc - that statement (for Search/Solr) was simply a leftover from before 4.6. Besides, it's in the Analytics section, which is not relevant for Search/Solr anyway. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote: 30% overhead is pretty brutal. I think this is basic support for it, and not necessarily a recommendation to use it. From http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes *DataStax does not recommend turning on vnodes *for other Hadoop use cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on Hadoop nodes, disable virtual nodes before using the cluster. On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com wrote: DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for Search workloads is now no longer officially discouraged. As per the official doc for improvements, : *Ability to use virtual nodes (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%)*. A vnode token count of 64 or 32 would reduce that overhead further. And... the new 4.6 feature of being able to direct a Solr query to a specific partition essentially eliminates that overhead entirely. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote: Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
Right, and subject to techniques for reducing that overhead that I listed. In fact, I would recommend simply picking the largest number of tokens for which the overhead is acceptable for your app, even if it is only 8 or 16 tokens, by 16, 32, or 64 may be sufficient for most apps. -- Jack Krupansky On Mon, Feb 23, 2015 at 3:01 PM, Eric Stevens migh...@gmail.com wrote: That link is the one from the 4.6 New Features page: http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html - Ability to use virtual nodes (vnodes) http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html#anaNdeOps__implicationsVnodes in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%) Anyway, thanks for clearing this up Jack. This overhead is on queries only, right? On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Thanks for pointing out a mistake in the doc - that statement (for Search/Solr) was simply a leftover from before 4.6. Besides, it's in the Analytics section, which is not relevant for Search/Solr anyway. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote: 30% overhead is pretty brutal. I think this is basic support for it, and not necessarily a recommendation to use it. From http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes *DataStax does not recommend turning on vnodes *for other Hadoop use cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on Hadoop nodes, disable virtual nodes before using the cluster. On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com wrote: DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for Search workloads is now no longer officially discouraged. As per the official doc for improvements, : *Ability to use virtual nodes (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%)*. A vnode token count of 64 or 32 would reduce that overhead further. And... the new 4.6 feature of being able to direct a Solr query to a specific partition essentially eliminates that overhead entirely. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote: Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
memtable_offheap_space_in_mb and memtable_cleanup_threshold
Hi everyone! I do write only workload (into one column family) and experiment with offheap-objects memtable space. I set parameters to:/ //memtable_offheap_space_in_mb = 51200 # 50Gb// //memtable_cleanup_threshold = 0.99/ and expect that flush will not be triggered until available /memtable offheap space /reaches ~50Gb. But flushes are triggered before that limit. System monitor shows that in used only ~16Gb at that moment (linux+jvm+heap+...). Why such thing is happened? -- Thanks, Serj
Problem with Cassandra 2.1 and Spark 1.2.1
Hi all, I'm trying to use Spark and Cassandra. I have two datacenter in different regions on AWS, and tried ran simple table count program. However, I'm still getting * WARN TaskSchedulerImpl: Initial job has not accepted any resources; * , and Spark can't finish the processing. The test table only has 571 rows and 2 small columns. I assume it doesn't require a lot of memory for small table. I also tried increasing Cores and Ram in Spark config files, but the result is still same. scala import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.{SparkContext, SparkConf} scala val conf = new SparkConf(true).set(spark.cassandra.connection.host, 172.17.10.44).set(spark.cassandra.auth.username, masteruser).set(spark.cassandra.auth.password, password) conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@1cfffdf3 scala val sc = new SparkContext(spark://172.17.10.182:7077, test, conf) 15/02/23 21:56:21 INFO SecurityManager: Changing view acls to: root 15/02/23 21:56:21 INFO SecurityManager: Changing modify acls to: root 15/02/23 21:56:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 15/02/23 21:56:21 INFO Slf4jLogger: Slf4jLogger started 15/02/23 21:56:21 INFO Remoting: Starting remoting 15/02/23 21:56:21 INFO Utils: Successfully started service 'sparkDriver' on port 41709. 15/02/23 21:56:21 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@ip-172-17-10-182:41709] 15/02/23 21:56:21 INFO SparkEnv: Registering MapOutputTracker 15/02/23 21:56:21 INFO SparkEnv: Registering BlockManagerMaster 15/02/23 21:56:21 INFO DiskBlockManager: Created local directory at /srv/spark/tmp/spark-9f50ea1b-e8eb-4cb8-8f48-d04e3ec525a2/spark-61a2d7fa-697e-4a61-80af-c3d72149f244 15/02/23 21:56:21 INFO MemoryStore: MemoryStore started with capacity 534.5 MB 15/02/23 21:56:21 INFO HttpFileServer: HTTP File server directory is /srv/spark/tmp/spark-1c34ed81-1ea9-45b1-81dd-184f12b975f6/spark-7c001536-1b70-40ea-9013-14551ad05a29 15/02/23 21:56:21 INFO HttpServer: Starting HTTP Server 15/02/23 21:56:21 INFO Utils: Successfully started service 'HTTP file server' on port 51439. 15/02/23 21:56:21 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/02/23 21:56:21 INFO SparkUI: Started SparkUI at http://52.10.105.190:4040 15/02/23 21:56:21 INFO SparkContext: Added JAR file:/home/ubuntu/spark-cassandra-connector/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar at http://172.17.10.182:51439/jars/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar with timestamp 1424728581916 15/02/23 21:56:21 INFO AppClient$ClientActor: Connecting to master spark://172.17.10.182:7077... 15/02/23 21:56:21 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20150223215621-0010 15/02/23 21:56:21 INFO NettyBlockTransferService: Server created on 45474 15/02/23 21:56:21 INFO BlockManagerMaster: Trying to register BlockManager 15/02/23 21:56:21 INFO BlockManagerMasterActor: Registering block manager ip-172-17-10-182:45474 with 534.5 MB RAM, BlockManagerId(driver, ip-172-17-10-182, 45474) 15/02/23 21:56:21 INFO BlockManagerMaster: Registered BlockManager 15/02/23 21:56:22 INFO AppClient$ClientActor: Executor added: app-20150223215621-0010/0 on worker-20150223191054-ip-172-17-10-45-9000 (ip-172-17-10-45:9000) with 2 cores 15/02/23 21:56:22 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150223215621-0010/0 on hostPort ip-172-17-10-45:9000 with 2 cores, 512.0 MB RAM 15/02/23 21:56:22 INFO AppClient$ClientActor: Executor added: app-20150223215621-0010/1 on worker-20150223191054-ip-172-17-10-47-9000 (ip-172-17-10-47:9000) with 2 cores 15/02/23 21:56:22 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150223215621-0010/1 on hostPort ip-172-17-10-47:9000 with 2 cores, 512.0 MB RAM 15/02/23 21:56:22 INFO AppClient$ClientActor: Executor added: app-20150223215621-0010/2 on worker-20150223191055-ip-172-17-10-46-9000 (ip-172-17-10-46:9000) with 2 cores 15/02/23 21:56:22 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150223215621-0010/2 on hostPort ip-172-17-10-46:9000 with 2 cores, 512.0 MB RAM 15/02/23 21:56:22 INFO AppClient$ClientActor: Executor added: app-20150223215621-0010/3 on worker-20150223191051-ip-172-17-10-44-9000 (ip-172-17-10-44:9000) with 2 cores 15/02/23 21:56:22 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150223215621-0010/3 on hostPort
Re: AMI to use to launch a cluster with OpsCenter on AWS
Regarding AWS the only thing I normally do (besides the normal installation, etc) is setting up the firewall zones so the ports needed for Cassandra are open. You can follow this guide: https://razvantudorica.com/02/create-a-cassandra-cluster-with-opscenter-on-amazon-ec2/a Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Sat, Feb 21, 2015 at 4:48 AM, Clint Kelly clint.ke...@gmail.com wrote: BTW I was able to use this script: https://github.com/joaquincasares/cassandralauncher to get a cluster up and running pretty easily on AWS. Cheers to the author for this. Still curious for answers to my questions above, but not as urgent. Best regards, Clint On Fri, Feb 20, 2015 at 5:36 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, I am trying to follow the instructions here for installing DSE 4.6 on AWS: http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMIOpsc.html I was successful creating a single-node instance running OpsCenter, which I intended to bootstrap creating a larger cluster running Cassandra and Spark. During my first attempt, however, OpsCenter reported problems talking to agents in the new cluster I was creating. I ssh'ed into one of the new instances that I created with OpsCenter and saw that this was the problem: DataStax AMI for DataStax Enterprise and DataStax Community AMI version 2.4 DataStax AMI 2.5 released 02.25.2014 http://goo.gl/g1RRd7 This AMI (version 2.4) will be left available, but no longer updated. These notices occurred during the startup of this instance: [ERROR] 02/21/15-00:53:01 sudo chown -R cassandra:cassandra /mnt/cassandra: [WARN] Permissions not set correctly. Please run manually: [WARN] sudo chown -hR cassandra:cassandra /mnt/cassandra [WARN] sudo service dse restart It looks like by default, the OpsCenter GUI selects an out-of-date AMI (ami-4c32ba7c) when you click on Create Cluster and attempt to create a brand-new cluster on EC2. What is the recommended image to use here? I found a version 2.5.1 of the autoclustering AMI ( http://thecloudmarket.com/image/ami-ada2b6c4--datastax-auto-clustering-ami-2-5-1-hvm). Is that correct? Or should I be using one of the regular AMIs listed at http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMIOpsc.html ? Or just a standard ubuntu image? FWIW, I tried just using one of the AMIs listed on the DSE 4.6 page (ami-32f7c977), and I still see the Waiting for the agent to start message, although if I log in, things look like they have kind of worked: Cluster started with these options: None Raiding complete. Waiting for nodetool... The cluster is now in it's finalization phase. This should only take a moment... Note: You can also use CTRL+C to view the logs if desired: AMI log: ~/datastax_ami/ami.log Cassandra log: /var/log/cassandra/system.log Datacenter: us-west-2 = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.28.24.19 51.33 KB 256 100.0% ce0d365a-d58b-4700-b861-9f30af400476 2a Opscenter: http://ec2-54-71-102-180.us-west-2.compute.amazonaws.com:/ Please wait 60 seconds if this is the cluster's first start... Tools: Run: datastax_tools Demos: Run: datastax_demos Support: Run: datastax_support DataStax AMI for DataStax Enterprise and DataStax Community AMI version 2.5 DataStax Community version 2.1.3-1 These notices occurred during the startup of this instance: [ERROR] 02/21/15-01:18:55 sudo chown opscenter-agent:opscenter-agent /var/lib/datastax-agent/conf: [ERROR] 02/21/15-01:19:04 sudo chown -R opscenter-agent:opscenter-agent /var/log/datastax-agent: [ERROR] 02/21/15-01:19:04 sudo chown -R opscenter-agent:opscenter-agent /mnt/datastax-agent: I would appreciate any help... I assume what I'm trying to do here is pretty common. -- --
Re: One node taking more resources than others in the ring
If you're not using prepared statements you won't get any token aware routing. That's an even better option than round robin since it reduces the number of nodes involved. On Mon, Feb 23, 2015 at 4:48 PM Robert Coli rc...@eventbrite.com wrote: On Mon, Feb 23, 2015 at 3:42 PM, Jaydeep Chovatia chovatia.jayd...@gmail.com wrote: I have created different tables and my test application reads/writes with CL=QUORUM. Under load I found that my one node is taking more resources (double CPU) than the other two. I have also verified that there is no other process causing this problem. My bold prediction is that you are sending all client connections to this node. Don't do that, round-robin them. =Rob
Efficient .net client for cassandra
Hi All, We have been able to find our case specific full text which we are analyzing using Staratio Cassandra. It has modified secondary index api which uses lucene indices. The erformace also seems good to me . Still i wanted to ask you gurus 1) Has anybody used Startio and any drawbacks of it 2) We are using .Net as the client to extract data which lacks performance . I am using the tradition connection pooling and then executing the prepared statement. So anybody who is using any specific client for .net would help me on this Thanks in advance for the help Thanks and Regards Asit On Mon, Feb 23, 2015 at 1:09 PM, Asit KAUSHIK asitkaushikno...@gmail.com wrote: Hi All, We have been able to find our case specific full text which we are analyzing using Staratio Cassandra. It has modified secondary index api which uses lucene indices. The erformace also seems good to me . Still i wanted to ask you gurus 1) Has anybody used Startio and any drawbacks of it 2) We are using .Net as the client to extract data which lacks performance . I am using the tradition connection pooling and then executing the prepared statement. So anybody who is using any specific client for .net would help me on this Thanks in advance for the help Thanks and Regards Asit
Re: One node taking more resources than others in the ring
On Mon, Feb 23, 2015 at 3:42 PM, Jaydeep Chovatia chovatia.jayd...@gmail.com wrote: I have created different tables and my test application reads/writes with CL=QUORUM. Under load I found that my one node is taking more resources (double CPU) than the other two. I have also verified that there is no other process causing this problem. My bold prediction is that you are sending all client connections to this node. Don't do that, round-robin them. =Rob
Re: One node taking more resources than others in the ring
On Mon, Feb 23, 2015 at 5:18 PM, Jonathan Haddad j...@jonhaddad.com wrote: If you're not using prepared statements you won't get any token aware routing. That's an even better option than round robin since it reduces the number of nodes involved. Fair statement. Thrust of my comment is don't send all connections to that node. :D =Rob
One node taking more resources than others in the ring
Hi, I have three node cluster with RF=1 (only one Datacenter) with following size: Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens Owns Host ID Rack UN IP1 4.02 GB1 33.3% ID1 RAC1 UN IP2 4.05 GB1 33.3% ID2 RAC2 UN IP3 4.05 GB1 33.3% ID3 RAC3 I have created different tables and my test application reads/writes with CL=QUORUM. Under load I found that my one node is taking more resources (double CPU) than the other two. I have also verified that there is no other process causing this problem. My hardware configuration on all nodes is same around Linux + 64-bit + 24 core + 64GB + 1TB My Cassandra version is 2.0 and JDK 1.7 Jaydeep