subject:"Why no virtual nodes for Cassandra on EC2\?"

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Clint Kelly

Hi mck,

I'm not familiar with this ticket, but my understanding was that
performance of Hadoop jobs on C* clusters with vnodes was poor because a
given Hadoop input split has to run many individual scans (one for each
vnode) rather than just a single scan.  I've run C* and Hadoop in
production with a custom input format that used vnodes (and just combined
multiple vnodes in a single input split) and didn't have any issues (the
jobs had many other performance bottlenecks besides starting multiple scans
from C*).

This is one of the videos where I recall an off-hand mention of the Spark
connector working with vnodes: https://www.youtube.com/watch?v=1NtnrdIUlg0

Best regards,
Clint




On Sat, Feb 21, 2015 at 2:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the Cassandra
 Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread mck


 … my understanding was that
 performance of Hadoop jobs on C* clusters with vnodes was poor because a
 given Hadoop input split has to run many individual scans (one for each
 vnode) rather than just a single scan.  I've run C* and Hadoop in
 production with a custom input format that used vnodes (and just combined
 multiple vnodes in a single input split) and didn't have any issues (the
 jobs had many other performance bottlenecks besides starting multiple
 scans from C*).

You've described the ticket, and how it has been solved :-)

 This is one of the videos where I recall an off-hand mention of the Spark
 connector working with vnodes:
 https://www.youtube.com/watch?v=1NtnrdIUlg0

Thanks.

~mck

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Eric Stevens

Vnodes is officially disrecommended for DSE Solr integration (though a
small number isn't ruinous). That might be why they still don't enable them
by default.
On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the Cassandra
 Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Jack Krupansky

DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for
Search workloads is now no longer officially discouraged. As per the
official doc for improvements, : *Ability to use virtual nodes (vnodes) in
Solr nodes. Recommended range: 64 to 256 (overhead increases by
approximately 30%)*. A vnode token count of 64 or 32 would reduce that
overhead further. And... the new 4.6 feature of being able to direct a Solr
query to a specific partition essentially eliminates that overhead entirely.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote:

 Vnodes is officially disrecommended for DSE Solr integration (though a
 small number isn't ruinous). That might be why they still don't enable them
 by default.
 On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the Cassandra
 Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Jack Krupansky

Thanks for pointing out a mistake in the doc - that statement (for
Search/Solr) was simply a leftover from before 4.6. Besides, it's in the
Analytics section, which is not relevant for Search/Solr anyway.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote:

30% overhead is pretty brutal. I think this is basic support for it, and
not necessarily a recommendation to use it.

From

http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes

*DataStax does not recommend turning on vnodes *for other Hadoop use
cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only
cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra
deployment. If you have enabled virtual nodes on Hadoop nodes, disable
virtual nodes before using the cluster.

On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for
Search workloads is now no longer officially discouraged. As per the
official doc for improvements, : *Ability to use virtual nodes (vnodes)
in Solr nodes. Recommended range: 64 to 256 (overhead increases by
approximately 30%)*. A vnode token count of 64 or 32 would reduce that
overhead further. And... the new 4.6 feature of being able to direct a Solr
query to a specific partition essentially eliminates that overhead entirely.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote:

Vnodes is officially disrecommended for DSE Solr integration (though a
small number isn't ruinous). That might be why they still don't enable them
by default.
On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

At least the problem of hadoop and vnodes described in CASSANDRA-6091
doesn't apply to spark.
(Spark already allows multiple token ranges per split).

If this is the reason why DSE hasn't enabled vnodes then fingers crossed
that'll change soon.

Some of the DataStax videos that I watched discussed how the
Cassandra Spark connecter has
optimizations to deal with vnodes.

Are these videos public? if so got any link to them?

~mck

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Eric Stevens

That link is the one from the 4.6 New Features page:
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html

- Ability to use virtual nodes (vnodes)

http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html#anaNdeOps__implicationsVnodes
in
Solr nodes. Recommended range: 64 to 256 (overhead increases by
approximately 30%)

Anyway, thanks for clearing this up Jack. This overhead is on queries
only, right?

On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote:

30% overhead is pretty brutal. I think this is basic support for it, and
not necessarily a recommendation to use it.

From

http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes

On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for
Search workloads is now no longer officially discouraged. As per the
official doc for improvements, : *Ability to use virtual nodes
(vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by
approximately 30%)*. A vnode token count of 64 or 32 would reduce that
overhead further. And... the new 4.6 feature of being able to direct a Solr
query to a specific partition essentially eliminates that overhead entirely.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com
wrote:

At least the problem of hadoop and vnodes described in CASSANDRA-6091
doesn't apply to spark.
(Spark already allows multiple token ranges per split).

If this is the reason why DSE hasn't enabled vnodes then fingers
crossed
that'll change soon.

Some of the DataStax videos that I watched discussed how the
Cassandra Spark connecter has
optimizations to deal with vnodes.

Are these videos public? if so got any link to them?

~mck

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Jack Krupansky

Right, and subject to techniques for reducing that overhead that I listed.
In fact, I would recommend simply picking the largest number of tokens for
which the overhead is acceptable for your app, even if it is only 8 or 16
tokens, by 16, 32, or 64 may be sufficient for most apps.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 3:01 PM, Eric Stevens migh...@gmail.com wrote:

That link is the one from the 4.6 New Features page:
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html

- Ability to use virtual nodes (vnodes)

Anyway, thanks for clearing this up Jack. This overhead is on queries
only, right?

On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote:

30% overhead is pretty brutal. I think this is basic support for it,
and not necessarily a recommendation to use it.

From

http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes

*DataStax does not recommend turning on vnodes *for other Hadoop use
cases *or for Solr nodes*, but you can use vnodes for any
Cassandra-only cluster, or a Cassandra-only data center in a mixed
Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on
Hadoop nodes, disable virtual nodes before using the cluster.

On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky
jack.krupan...@gmail.com wrote:

DSE 4.6 improved Solr vnode performance dramatically, so that vnodes
for Search workloads is now no longer officially discouraged. As per the
official doc for improvements, : *Ability to use virtual nodes
(vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by
approximately 30%)*. A vnode token count of 64 or 32 would reduce
that overhead further. And... the new 4.6 feature of being able to direct a
Solr query to a specific partition essentially eliminates that overhead
entirely.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com
wrote:

Vnodes is officially disrecommended for DSE Solr integration (though a
small number isn't ruinous). That might be why they still don't enable
them
by default.
On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

At least the problem of hadoop and vnodes described in CASSANDRA-6091
doesn't apply to spark.
(Spark already allows multiple token ranges per split).

If this is the reason why DSE hasn't enabled vnodes then fingers
crossed
that'll change soon.

Some of the DataStax videos that I watched discussed how the
Cassandra Spark connecter has
optimizations to deal with vnodes.

Are these videos public? if so got any link to them?

~mck

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-21 Thread mck

At least the problem of hadoop and vnodes described in CASSANDRA-6091
doesn't apply to spark.
 (Spark already allows multiple token ranges per split).

If this is the reason why DSE hasn't enabled vnodes then fingers crossed
that'll change soon.


 Some of the DataStax videos that I watched discussed how the Cassandra Spark 
 connecter has 
 optimizations to deal with vnodes.


Are these videos public? if so got any link to them?

~mck

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-20 Thread Mark Reddy

Hey Clint,

Someone for DataStax can correct me here, but I'm assuming that they have
disabled vnodes because the AMI is built to make it easy to set up a
pre-configured
mixed workload cluster. A mixture of Real-Time/Transactional (Cassandra),
Analytics (Hadoop), or Search (Solr). If you take a look at the getting
started guide for both Hadoop and Solr you will see a paragraph instructing
the user to disable vnodes for a mix workload cluster.

http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax
_enterprise/srch/srchIntro.html
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax
_enterprise/ana/anaStrt.html

This is specific to the example AMI and that type of workload. This is by
no means a warning for users to disable vnodes on their
Real-Time/Transactional Cassandra only clusters on EC2.


I've used vnodes on EC2 without issue.

Regards,
Mark

On 20 February 2015 at 05:08, Clint Kelly clint.ke...@gmail.com wrote:

 Hi all,

 The guide for installing Cassandra on EC2 says that

 Note: The DataStax AMI does not install DataStax Enterprise nodes
 with virtual nodes enabled.


 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

 Just curious why this is the case.  It was my understanding that
 virtual nodes make taking Cassandra nodes on and offline an easier
 process, and that seems like something that an EC2 user would want to
 do quite frequently.

 -Clint

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-20 Thread Clint Kelly

BTW are the performance concerns with vnodes a big deal for Spark? Or were
those more important for MapReduce? Some of the DataStax videos that I
watched discussed how the Cassandra Spark connecter has optimizations to
deal with vnodes.

I would imagine that Spark's ability to cache RDDs would mean that paying a
small efficiency cost when reading data out of Cassandra initially might
not be the end of the world (especially given the benefits of using vnodes).

On Fri, Feb 20, 2015 at 8:29 AM, Clint Kelly clint.ke...@gmail.com wrote:

Hi Mark,

Thanks for your reply. That makes sense. I recall looking at this
back when we were going to run Hadoop against data in Cassandra tables
at my previous company.

Disabling virtual nodes seems unfortunate as it would make (as I
understand it) scaling the cluster a lot trickier. I assume there is
a tradeoff between the performance of analytics jobs and the ease with
which you can change cluster size.

-Clint

On Fri, Feb 20, 2015 at 1:01 AM, Mark Reddy mark.l.re...@gmail.com
wrote:
Hey Clint,

Someone for DataStax can correct me here, but I'm assuming that they have
disabled vnodes because the AMI is built to make it easy to set up a
pre-configured mixed workload cluster. A mixture of
Real-Time/Transactional
(Cassandra), Analytics (Hadoop), or Search (Solr). If you take a look at
the
getting started guide for both Hadoop and Solr you will see a paragraph
instructing the user to disable vnodes for a mix workload cluster.

http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchIntro.html

http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaStrt.html

This is specific to the example AMI and that type of workload. This is
by no
means a warning for users to disable vnodes on their
Real-Time/Transactional
Cassandra only clusters on EC2.

I've used vnodes on EC2 without issue.

Regards,
Mark

On 20 February 2015 at 05:08, Clint Kelly clint.ke...@gmail.com wrote:

Hi all,

The guide for installing Cassandra on EC2 says that

Note: The DataStax AMI does not install DataStax Enterprise nodes
with virtual nodes enabled.

http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

Just curious why this is the case. It was my understanding that
virtual nodes make taking Cassandra nodes on and offline an easier
process, and that seems like something that an EC2 user would want to
do quite frequently.

-Clint

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-20 Thread Clint Kelly

Hi Mark,

Thanks for your reply. That makes sense. I recall looking at this
back when we were going to run Hadoop against data in Cassandra tables
at my previous company.

-Clint

On Fri, Feb 20, 2015 at 1:01 AM, Mark Reddy mark.l.re...@gmail.com wrote:
Hey Clint,

Someone for DataStax can correct me here, but I'm assuming that they have
disabled vnodes because the AMI is built to make it easy to set up a
pre-configured mixed workload cluster. A mixture of Real-Time/Transactional
(Cassandra), Analytics (Hadoop), or Search (Solr). If you take a look at the
getting started guide for both Hadoop and Solr you will see a paragraph
instructing the user to disable vnodes for a mix workload cluster.

http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/srch/srchIntro.html
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaStrt.html

This is specific to the example AMI and that type of workload. This is by no
means a warning for users to disable vnodes on their Real-Time/Transactional
Cassandra only clusters on EC2.

I've used vnodes on EC2 without issue.

Regards,
Mark

On 20 February 2015 at 05:08, Clint Kelly clint.ke...@gmail.com wrote:

Hi all,

The guide for installing Cassandra on EC2 says that

Note: The DataStax AMI does not install DataStax Enterprise nodes
with virtual nodes enabled.

http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

-Clint

Why no virtual nodes for Cassandra on EC2?

2015-02-19 Thread Clint Kelly

Hi all,

The guide for installing Cassandra on EC2 says that

Note: The DataStax AMI does not install DataStax Enterprise nodes
with virtual nodes enabled.

http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

Just curious why this is the case.  It was my understanding that
virtual nodes make taking Cassandra nodes on and offline an easier
process, and that seems like something that an EC2 user would want to
do quite frequently.

-Clint

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Re: Why no virtual nodes for Cassandra on EC2?

Why no virtual nodes for Cassandra on EC2?

12 matches

Site Navigation

Mail list logo

Footer information