Re: Why no virtual nodes for Cassandra on EC2?
Hi mck, I'm not familiar with this ticket, but my understanding was that performance of Hadoop jobs on C* clusters with vnodes was poor because a given Hadoop input split has to run many individual scans (one for each vnode) rather than just a single scan. I've run C* and Hadoop in production with a custom input format that used vnodes (and just combined multiple vnodes in a single input split) and didn't have any issues (the jobs had many other performance bottlenecks besides starting multiple scans from C*). This is one of the videos where I recall an off-hand mention of the Spark connector working with vnodes: https://www.youtube.com/watch?v=1NtnrdIUlg0 Best regards, Clint On Sat, Feb 21, 2015 at 2:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
… my understanding was that performance of Hadoop jobs on C* clusters with vnodes was poor because a given Hadoop input split has to run many individual scans (one for each vnode) rather than just a single scan. I've run C* and Hadoop in production with a custom input format that used vnodes (and just combined multiple vnodes in a single input split) and didn't have any issues (the jobs had many other performance bottlenecks besides starting multiple scans from C*). You've described the ticket, and how it has been solved :-) This is one of the videos where I recall an off-hand mention of the Spark connector working with vnodes: https://www.youtube.com/watch?v=1NtnrdIUlg0 Thanks. ~mck
Re: Why no virtual nodes for Cassandra on EC2?
Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for Search workloads is now no longer officially discouraged. As per the official doc for improvements, : *Ability to use virtual nodes (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%)*. A vnode token count of 64 or 32 would reduce that overhead further. And... the new 4.6 feature of being able to direct a Solr query to a specific partition essentially eliminates that overhead entirely. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote: Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
Thanks for pointing out a mistake in the doc - that statement (for Search/Solr) was simply a leftover from before 4.6. Besides, it's in the Analytics section, which is not relevant for Search/Solr anyway. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote: 30% overhead is pretty brutal. I think this is basic support for it, and not necessarily a recommendation to use it. From http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes *DataStax does not recommend turning on vnodes *for other Hadoop use cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on Hadoop nodes, disable virtual nodes before using the cluster. On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com wrote: DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for Search workloads is now no longer officially discouraged. As per the official doc for improvements, : *Ability to use virtual nodes (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%)*. A vnode token count of 64 or 32 would reduce that overhead further. And... the new 4.6 feature of being able to direct a Solr query to a specific partition essentially eliminates that overhead entirely. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote: Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
That link is the one from the 4.6 New Features page: http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html - Ability to use virtual nodes (vnodes) http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html#anaNdeOps__implicationsVnodes in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%) Anyway, thanks for clearing this up Jack. This overhead is on queries only, right? On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Thanks for pointing out a mistake in the doc - that statement (for Search/Solr) was simply a leftover from before 4.6. Besides, it's in the Analytics section, which is not relevant for Search/Solr anyway. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote: 30% overhead is pretty brutal. I think this is basic support for it, and not necessarily a recommendation to use it. From http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes *DataStax does not recommend turning on vnodes *for other Hadoop use cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on Hadoop nodes, disable virtual nodes before using the cluster. On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com wrote: DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for Search workloads is now no longer officially discouraged. As per the official doc for improvements, : *Ability to use virtual nodes (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%)*. A vnode token count of 64 or 32 would reduce that overhead further. And... the new 4.6 feature of being able to direct a Solr query to a specific partition essentially eliminates that overhead entirely. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote: Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
Right, and subject to techniques for reducing that overhead that I listed. In fact, I would recommend simply picking the largest number of tokens for which the overhead is acceptable for your app, even if it is only 8 or 16 tokens, by 16, 32, or 64 may be sufficient for most apps. -- Jack Krupansky On Mon, Feb 23, 2015 at 3:01 PM, Eric Stevens migh...@gmail.com wrote: That link is the one from the 4.6 New Features page: http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html - Ability to use virtual nodes (vnodes) http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html#anaNdeOps__implicationsVnodes in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%) Anyway, thanks for clearing this up Jack. This overhead is on queries only, right? On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Thanks for pointing out a mistake in the doc - that statement (for Search/Solr) was simply a leftover from before 4.6. Besides, it's in the Analytics section, which is not relevant for Search/Solr anyway. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote: 30% overhead is pretty brutal. I think this is basic support for it, and not necessarily a recommendation to use it. From http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes *DataStax does not recommend turning on vnodes *for other Hadoop use cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on Hadoop nodes, disable virtual nodes before using the cluster. On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com wrote: DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for Search workloads is now no longer officially discouraged. As per the official doc for improvements, : *Ability to use virtual nodes (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%)*. A vnode token count of 64 or 32 would reduce that overhead further. And... the new 4.6 feature of being able to direct a Solr query to a specific partition essentially eliminates that overhead entirely. -- Jack Krupansky On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote: Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote: At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
At least the problem of hadoop and vnodes described in CASSANDRA-6091 doesn't apply to spark. (Spark already allows multiple token ranges per split). If this is the reason why DSE hasn't enabled vnodes then fingers crossed that'll change soon. Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. Are these videos public? if so got any link to them? ~mck
Re: Why no virtual nodes for Cassandra on EC2?
Hey Clint, Someone for DataStax can correct me here, but I'm assuming that they have disabled vnodes because the AMI is built to make it easy to set up a pre-configured mixed workload cluster. A mixture of Real-Time/Transactional (Cassandra), Analytics (Hadoop), or Search (Solr). If you take a look at the getting started guide for both Hadoop and Solr you will see a paragraph instructing the user to disable vnodes for a mix workload cluster. http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax _enterprise/srch/srchIntro.html http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax _enterprise/ana/anaStrt.html This is specific to the example AMI and that type of workload. This is by no means a warning for users to disable vnodes on their Real-Time/Transactional Cassandra only clusters on EC2. I've used vnodes on EC2 without issue. Regards, Mark On 20 February 2015 at 05:08, Clint Kelly clint.ke...@gmail.com wrote: Hi all, The guide for installing Cassandra on EC2 says that Note: The DataStax AMI does not install DataStax Enterprise nodes with virtual nodes enabled. http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html Just curious why this is the case. It was my understanding that virtual nodes make taking Cassandra nodes on and offline an easier process, and that seems like something that an EC2 user would want to do quite frequently. -Clint
Re: Why no virtual nodes for Cassandra on EC2?
BTW are the performance concerns with vnodes a big deal for Spark? Or were those more important for MapReduce? Some of the DataStax videos that I watched discussed how the Cassandra Spark connecter has optimizations to deal with vnodes. I would imagine that Spark's ability to cache RDDs would mean that paying a small efficiency cost when reading data out of Cassandra initially might not be the end of the world (especially given the benefits of using vnodes). On Fri, Feb 20, 2015 at 8:29 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi Mark, Thanks for your reply. That makes sense. I recall looking at this back when we were going to run Hadoop against data in Cassandra tables at my previous company. Disabling virtual nodes seems unfortunate as it would make (as I understand it) scaling the cluster a lot trickier. I assume there is a tradeoff between the performance of analytics jobs and the ease with which you can change cluster size. -Clint On Fri, Feb 20, 2015 at 1:01 AM, Mark Reddy mark.l.re...@gmail.com wrote: Hey Clint, Someone for DataStax can correct me here, but I'm assuming that they have disabled vnodes because the AMI is built to make it easy to set up a pre-configured mixed workload cluster. A mixture of Real-Time/Transactional (Cassandra), Analytics (Hadoop), or Search (Solr). If you take a look at the getting started guide for both Hadoop and Solr you will see a paragraph instructing the user to disable vnodes for a mix workload cluster. http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchIntro.html http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaStrt.html This is specific to the example AMI and that type of workload. This is by no means a warning for users to disable vnodes on their Real-Time/Transactional Cassandra only clusters on EC2. I've used vnodes on EC2 without issue. Regards, Mark On 20 February 2015 at 05:08, Clint Kelly clint.ke...@gmail.com wrote: Hi all, The guide for installing Cassandra on EC2 says that Note: The DataStax AMI does not install DataStax Enterprise nodes with virtual nodes enabled. http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html Just curious why this is the case. It was my understanding that virtual nodes make taking Cassandra nodes on and offline an easier process, and that seems like something that an EC2 user would want to do quite frequently. -Clint
Re: Why no virtual nodes for Cassandra on EC2?
Hi Mark, Thanks for your reply. That makes sense. I recall looking at this back when we were going to run Hadoop against data in Cassandra tables at my previous company. Disabling virtual nodes seems unfortunate as it would make (as I understand it) scaling the cluster a lot trickier. I assume there is a tradeoff between the performance of analytics jobs and the ease with which you can change cluster size. -Clint On Fri, Feb 20, 2015 at 1:01 AM, Mark Reddy mark.l.re...@gmail.com wrote: Hey Clint, Someone for DataStax can correct me here, but I'm assuming that they have disabled vnodes because the AMI is built to make it easy to set up a pre-configured mixed workload cluster. A mixture of Real-Time/Transactional (Cassandra), Analytics (Hadoop), or Search (Solr). If you take a look at the getting started guide for both Hadoop and Solr you will see a paragraph instructing the user to disable vnodes for a mix workload cluster. http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchIntro.html http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaStrt.html This is specific to the example AMI and that type of workload. This is by no means a warning for users to disable vnodes on their Real-Time/Transactional Cassandra only clusters on EC2. I've used vnodes on EC2 without issue. Regards, Mark On 20 February 2015 at 05:08, Clint Kelly clint.ke...@gmail.com wrote: Hi all, The guide for installing Cassandra on EC2 says that Note: The DataStax AMI does not install DataStax Enterprise nodes with virtual nodes enabled. http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html Just curious why this is the case. It was my understanding that virtual nodes make taking Cassandra nodes on and offline an easier process, and that seems like something that an EC2 user would want to do quite frequently. -Clint
Why no virtual nodes for Cassandra on EC2?
Hi all, The guide for installing Cassandra on EC2 says that Note: The DataStax AMI does not install DataStax Enterprise nodes with virtual nodes enabled. http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html Just curious why this is the case. It was my understanding that virtual nodes make taking Cassandra nodes on and offline an easier process, and that seems like something that an EC2 user would want to do quite frequently. -Clint