Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Clint Kelly
Hi mck,

I'm not familiar with this ticket, but my understanding was that
performance of Hadoop jobs on C* clusters with vnodes was poor because a
given Hadoop input split has to run many individual scans (one for each
vnode) rather than just a single scan.  I've run C* and Hadoop in
production with a custom input format that used vnodes (and just combined
multiple vnodes in a single input split) and didn't have any issues (the
jobs had many other performance bottlenecks besides starting multiple scans
from C*).

This is one of the videos where I recall an off-hand mention of the Spark
connector working with vnodes: https://www.youtube.com/watch?v=1NtnrdIUlg0

Best regards,
Clint




On Sat, Feb 21, 2015 at 2:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the Cassandra
 Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck



Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread mck

 … my understanding was that
 performance of Hadoop jobs on C* clusters with vnodes was poor because a
 given Hadoop input split has to run many individual scans (one for each
 vnode) rather than just a single scan.  I've run C* and Hadoop in
 production with a custom input format that used vnodes (and just combined
 multiple vnodes in a single input split) and didn't have any issues (the
 jobs had many other performance bottlenecks besides starting multiple
 scans from C*).

You've described the ticket, and how it has been solved :-)

 This is one of the videos where I recall an off-hand mention of the Spark
 connector working with vnodes:
 https://www.youtube.com/watch?v=1NtnrdIUlg0

Thanks.

~mck


Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Eric Stevens
Vnodes is officially disrecommended for DSE Solr integration (though a
small number isn't ruinous). That might be why they still don't enable them
by default.
On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the Cassandra
 Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck



Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Jack Krupansky
DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for
Search workloads is now no longer officially discouraged. As per the
official doc for improvements, : *Ability to use virtual nodes (vnodes) in
Solr nodes. Recommended range: 64 to 256 (overhead increases by
approximately 30%)*. A vnode token count of 64 or 32 would reduce that
overhead further. And... the new 4.6 feature of being able to direct a Solr
query to a specific partition essentially eliminates that overhead entirely.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote:

 Vnodes is officially disrecommended for DSE Solr integration (though a
 small number isn't ruinous). That might be why they still don't enable them
 by default.
 On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the Cassandra
 Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck




Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Jack Krupansky
Thanks for pointing out a mistake in the doc - that statement (for
Search/Solr) was simply a leftover from before 4.6. Besides, it's in the
Analytics section, which is not relevant for Search/Solr anyway.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote:

 30% overhead is pretty brutal.  I think this is basic support for it, and
 not necessarily a recommendation to use it.

 From

 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes

 *DataStax does not recommend turning on vnodes *for other Hadoop use
 cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only
 cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra
 deployment. If you have enabled virtual nodes on Hadoop nodes, disable
 virtual nodes before using the cluster.


 On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

 DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for
 Search workloads is now no longer officially discouraged. As per the
 official doc for improvements, : *Ability to use virtual nodes (vnodes)
 in Solr nodes. Recommended range: 64 to 256 (overhead increases by
 approximately 30%)*. A vnode token count of 64 or 32 would reduce that
 overhead further. And... the new 4.6 feature of being able to direct a Solr
 query to a specific partition essentially eliminates that overhead entirely.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote:

 Vnodes is officially disrecommended for DSE Solr integration (though a
 small number isn't ruinous). That might be why they still don't enable them
 by default.
 On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the
 Cassandra Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck






Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Eric Stevens
That link is the one from the 4.6 New Features page:
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html

   - Ability to use virtual nodes (vnodes)
   
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html#anaNdeOps__implicationsVnodes
in
   Solr nodes. Recommended range: 64 to 256 (overhead increases by
   approximately 30%)

Anyway, thanks for clearing this up Jack.  This overhead is on queries
only, right?



On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Thanks for pointing out a mistake in the doc - that statement (for
 Search/Solr) was simply a leftover from before 4.6. Besides, it's in the
 Analytics section, which is not relevant for Search/Solr anyway.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote:

 30% overhead is pretty brutal.  I think this is basic support for it, and
 not necessarily a recommendation to use it.

 From

 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes

 *DataStax does not recommend turning on vnodes *for other Hadoop use
 cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only
 cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra
 deployment. If you have enabled virtual nodes on Hadoop nodes, disable
 virtual nodes before using the cluster.


 On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com
  wrote:

 DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for
 Search workloads is now no longer officially discouraged. As per the
 official doc for improvements, : *Ability to use virtual nodes
 (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by
 approximately 30%)*. A vnode token count of 64 or 32 would reduce that
 overhead further. And... the new 4.6 feature of being able to direct a Solr
 query to a specific partition essentially eliminates that overhead entirely.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com
 wrote:

 Vnodes is officially disrecommended for DSE Solr integration (though a
 small number isn't ruinous). That might be why they still don't enable them
 by default.
 On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers
 crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the
 Cassandra Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck







Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Jack Krupansky
Right, and subject to techniques for reducing that overhead that I listed.
In fact, I would recommend simply picking the largest number of tokens for
which the overhead is acceptable for your app, even if it is only 8 or 16
tokens, by 16, 32, or 64 may be sufficient for most apps.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 3:01 PM, Eric Stevens migh...@gmail.com wrote:

 That link is the one from the 4.6 New Features page:
 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html

- Ability to use virtual nodes (vnodes)

 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html#anaNdeOps__implicationsVnodes
  in
Solr nodes. Recommended range: 64 to 256 (overhead increases by
approximately 30%)

 Anyway, thanks for clearing this up Jack.  This overhead is on queries
 only, right?



 On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com
  wrote:

 Thanks for pointing out a mistake in the doc - that statement (for
 Search/Solr) was simply a leftover from before 4.6. Besides, it's in the
 Analytics section, which is not relevant for Search/Solr anyway.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote:

 30% overhead is pretty brutal.  I think this is basic support for it,
 and not necessarily a recommendation to use it.

 From

 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes

 *DataStax does not recommend turning on vnodes *for other Hadoop use
 cases *or for Solr nodes*, but you can use vnodes for any
 Cassandra-only cluster, or a Cassandra-only data center in a mixed
 Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on
 Hadoop nodes, disable virtual nodes before using the cluster.


 On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky 
 jack.krupan...@gmail.com wrote:

 DSE 4.6 improved Solr vnode performance dramatically, so that vnodes
 for Search workloads is now no longer officially discouraged. As per the
 official doc for improvements, : *Ability to use virtual nodes
 (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by
 approximately 30%)*. A vnode token count of 64 or 32 would reduce
 that overhead further. And... the new 4.6 feature of being able to direct a
 Solr query to a specific partition essentially eliminates that overhead
 entirely.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com
 wrote:

 Vnodes is officially disrecommended for DSE Solr integration (though a
 small number isn't ruinous). That might be why they still don't enable 
 them
 by default.
 On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers
 crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the
 Cassandra Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck








Re: Why no virtual nodes for Cassandra on EC2?

2015-02-21 Thread mck
At least the problem of hadoop and vnodes described in CASSANDRA-6091
doesn't apply to spark.
 (Spark already allows multiple token ranges per split).

If this is the reason why DSE hasn't enabled vnodes then fingers crossed
that'll change soon.


 Some of the DataStax videos that I watched discussed how the Cassandra Spark 
 connecter has 
 optimizations to deal with vnodes.


Are these videos public? if so got any link to them?

~mck 


Re: Why no virtual nodes for Cassandra on EC2?

2015-02-20 Thread Mark Reddy
Hey Clint,

Someone for DataStax can correct me here, but I'm assuming that they have
disabled vnodes because the AMI is built to make it easy to set up a
pre-configured
mixed workload cluster. A mixture of Real-Time/Transactional (Cassandra),
Analytics (Hadoop), or Search (Solr). If you take a look at the getting
started guide for both Hadoop and Solr you will see a paragraph instructing
the user to disable vnodes for a mix workload cluster.

http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax
_enterprise/srch/srchIntro.html
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax
_enterprise/ana/anaStrt.html

This is specific to the example AMI and that type of workload. This is by
no means a warning for users to disable vnodes on their
Real-Time/Transactional Cassandra only clusters on EC2.


I've used vnodes on EC2 without issue.

Regards,
Mark

On 20 February 2015 at 05:08, Clint Kelly clint.ke...@gmail.com wrote:

 Hi all,

 The guide for installing Cassandra on EC2 says that

 Note: The DataStax AMI does not install DataStax Enterprise nodes
 with virtual nodes enabled.


 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

 Just curious why this is the case.  It was my understanding that
 virtual nodes make taking Cassandra nodes on and offline an easier
 process, and that seems like something that an EC2 user would want to
 do quite frequently.

 -Clint



Re: Why no virtual nodes for Cassandra on EC2?

2015-02-20 Thread Clint Kelly
BTW are the performance concerns with vnodes a big deal for Spark?  Or were
those more important for MapReduce?  Some of the DataStax videos that I
watched discussed how the Cassandra Spark connecter has optimizations to
deal with vnodes.

I would imagine that Spark's ability to cache RDDs would mean that paying a
small efficiency cost when reading data out of Cassandra initially might
not be the end of the world (especially given the benefits of using vnodes).

On Fri, Feb 20, 2015 at 8:29 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi Mark,

 Thanks for your reply.  That makes sense.  I recall looking at this
 back when we were going to run Hadoop against data in Cassandra tables
 at my previous company.

 Disabling virtual nodes seems unfortunate as it would make (as I
 understand it) scaling the cluster a lot trickier.  I assume there is
 a tradeoff between the performance of analytics jobs and the ease with
 which you can change cluster size.

 -Clint



 On Fri, Feb 20, 2015 at 1:01 AM, Mark Reddy mark.l.re...@gmail.com
 wrote:
  Hey Clint,
 
  Someone for DataStax can correct me here, but I'm assuming that they have
  disabled vnodes because the AMI is built to make it easy to set up a
  pre-configured mixed workload cluster. A mixture of
 Real-Time/Transactional
  (Cassandra), Analytics (Hadoop), or Search (Solr). If you take a look at
 the
  getting started guide for both Hadoop and Solr you will see a paragraph
  instructing the user to disable vnodes for a mix workload cluster.
 
 
 http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchIntro.html
 
 http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaStrt.html
 
  This is specific to the example AMI and that type of workload. This is
 by no
  means a warning for users to disable vnodes on their
 Real-Time/Transactional
  Cassandra only clusters on EC2.
 
 
  I've used vnodes on EC2 without issue.
 
  Regards,
  Mark
 
  On 20 February 2015 at 05:08, Clint Kelly clint.ke...@gmail.com wrote:
 
  Hi all,
 
  The guide for installing Cassandra on EC2 says that
 
  Note: The DataStax AMI does not install DataStax Enterprise nodes
  with virtual nodes enabled.
 
 
 
 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html
 
  Just curious why this is the case.  It was my understanding that
  virtual nodes make taking Cassandra nodes on and offline an easier
  process, and that seems like something that an EC2 user would want to
  do quite frequently.
 
  -Clint
 
 



Re: Why no virtual nodes for Cassandra on EC2?

2015-02-20 Thread Clint Kelly
Hi Mark,

Thanks for your reply.  That makes sense.  I recall looking at this
back when we were going to run Hadoop against data in Cassandra tables
at my previous company.

Disabling virtual nodes seems unfortunate as it would make (as I
understand it) scaling the cluster a lot trickier.  I assume there is
a tradeoff between the performance of analytics jobs and the ease with
which you can change cluster size.

-Clint



On Fri, Feb 20, 2015 at 1:01 AM, Mark Reddy mark.l.re...@gmail.com wrote:
 Hey Clint,

 Someone for DataStax can correct me here, but I'm assuming that they have
 disabled vnodes because the AMI is built to make it easy to set up a
 pre-configured mixed workload cluster. A mixture of Real-Time/Transactional
 (Cassandra), Analytics (Hadoop), or Search (Solr). If you take a look at the
 getting started guide for both Hadoop and Solr you will see a paragraph
 instructing the user to disable vnodes for a mix workload cluster.

 http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchIntro.html
 http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaStrt.html

 This is specific to the example AMI and that type of workload. This is by no
 means a warning for users to disable vnodes on their Real-Time/Transactional
 Cassandra only clusters on EC2.


 I've used vnodes on EC2 without issue.

 Regards,
 Mark

 On 20 February 2015 at 05:08, Clint Kelly clint.ke...@gmail.com wrote:

 Hi all,

 The guide for installing Cassandra on EC2 says that

 Note: The DataStax AMI does not install DataStax Enterprise nodes
 with virtual nodes enabled.


 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

 Just curious why this is the case.  It was my understanding that
 virtual nodes make taking Cassandra nodes on and offline an easier
 process, and that seems like something that an EC2 user would want to
 do quite frequently.

 -Clint




Why no virtual nodes for Cassandra on EC2?

2015-02-19 Thread Clint Kelly
Hi all,

The guide for installing Cassandra on EC2 says that

Note: The DataStax AMI does not install DataStax Enterprise nodes
with virtual nodes enabled.

http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

Just curious why this is the case.  It was my understanding that
virtual nodes make taking Cassandra nodes on and offline an easier
process, and that seems like something that an EC2 user would want to
do quite frequently.

-Clint