Re: Experiences with repairs using vnodes

2014-10-24 Thread Yuki Morishita
If anyone used incremental repair feature in 2.1 environment with
vnodes, I'd like to hear how it is doing.
Validation is the main time consuming part of repair, and it should be
much better after you switch to incremental.

I did some experiment regarding CASSANDRA-5220 like doing repairing
some ranges together, but it just hammers node hard with little gain.
So if incremental repair is working as expected, I'll happy to won't fix #5220.

On Fri, Oct 24, 2014 at 6:30 PM, Robert Coli  wrote:
> On Fri, Oct 24, 2014 at 3:54 PM, Jack Krupansky 
> wrote:
>>
>> I was wondering if anybody had any specific experiences with repair and
>> bootstrapping new nodes after switching to vnodes that they could share here
>> (or email me privately.) I mean, how was the performance of repair and
>> bootstrap impacted, cluster reliability, cluster load, ease of maintaining
>> the cluster, more confidence in maintaining the cluster, or... whatever else
>> may have been impacted. IOW, what actual benefit/change did you experience
>> firsthand. Thanks!
>
>
> While I don't personally yet use vnodes, my understanding is...
>
> Repair gets much slower, bootstrapping gets faster and better distributed.
>
> https://issues.apache.org/jira/browse/CASSANDRA-5220
>
> Is a good starting point for the web of related JIRA tickets.
>
> =Rob
> http://twitter.com/rcolidba



-- 
Yuki Morishita
 t:yukim (http://twitter.com/yukim)


Re: Experiences with repairs using vnodes

2014-10-24 Thread Robert Coli
On Fri, Oct 24, 2014 at 3:54 PM, Jack Krupansky 
wrote:

>   I was wondering if anybody had any specific experiences with repair and
> bootstrapping new nodes after switching to vnodes that they could share
> here (or email me privately.) I mean, how was the performance of repair and
> bootstrap impacted, cluster reliability, cluster load, ease of maintaining
> the cluster, more confidence in maintaining the cluster, or... whatever
> else may have been impacted. IOW, what actual benefit/change did you
> experience firsthand. Thanks!
>

While I don't personally yet use vnodes, my understanding is...

Repair gets much slower, bootstrapping gets faster and better distributed.

https://issues.apache.org/jira/browse/CASSANDRA-5220

Is a good starting point for the web of related JIRA tickets.

=Rob
http://twitter.com/rcolidba


Experiences with repairs using vnodes

2014-10-24 Thread Jack Krupansky
I was wondering if anybody had any specific experiences with repair and 
bootstrapping new nodes after switching to vnodes that they could share here 
(or email me privately.) I mean, how was the performance of repair and 
bootstrap impacted, cluster reliability, cluster load, ease of maintaining the 
cluster, more confidence in maintaining the cluster, or... whatever else may 
have been impacted. IOW, what actual benefit/change did you experience 
firsthand. Thanks!

-- Jack Krupansky

Re: Intermittent long application pauses on nodes

2014-10-24 Thread graham sanderson
And -XX:SafepointTimeoutDelay=xxx

to set how long before it dumps output (defaults to 1 I believe)…

Note it doesn’t actually timeout by default, it just prints the problematic 
threads after that time and keeps on waiting

> On Oct 24, 2014, at 2:44 PM, graham sanderson  wrote:
> 
> Actually - there is 
> 
> -XX:+SafepointTimeout
> 
> which will print out offending threads (assuming you reach a 10 second pause)…
> 
> That is probably your best bet.
> 
>> On Oct 24, 2014, at 2:38 PM, graham sanderson > > wrote:
>> 
>> This certainly sounds like a JVM bug.
>> 
>> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
>> and don’t seem to have seen this (note we are on 7u67, so that might be an 
>> interesting data point, though since the old thread predated that probably 
>> not)
>> 
>> 1) From the app/java side, I’d obviously see if you can identify anything 
>> which always coincides with this - repair, compaction etc
>> 2) From the VM side (given that this as Benedict mentioned) some threads are 
>> taking a long time to rendezvous at the safe point, and it is probably not 
>> application threads, I’d look what GC threads, compiler threads etc might be 
>> doing. As mentioned it shouldn’t be anything to do with operations which run 
>> at a safe point anyway (e.g. scavenge)
>>  a) So look at what CMS is doing at the time and see if you can correlate
>>  b) Check Oracle for related bugs - didn’t obviously see any, but there 
>> have been some complaints related to compilation and safe points
>>  c) Add any compilation tracing you can
>>  d) Kind of important here - see if you can figure out via dtrace, 
>> system tap, gdb or whatever, what the threads are doing when this happens. 
>> Sadly it doesn’t look like you can figure out when this is happening (until 
>> afterwards) unless you have access to a debug JVM build (and can turn on 
>> -XX:+TraceSafepoint and look for a safe point start without a corresponding 
>> update within a time period) - if you don’t have access to that, I guess you 
>> could try and get a dump every 2-3 seconds (you should catch a 9 second 
>> pause eventually!)
>> 
>>> On Oct 24, 2014, at 12:35 PM, Dan van Kley >> > wrote:
>>> 
>>> I'm also curious to know if this was ever resolved or if there's any other 
>>> recommended steps to take to continue to track it down. I'm seeing the same 
>>> issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
>>> 1.7u71, using the CMS collector. Just as described above, the issue is long 
>>> "Total time for which application threads were stopped" pauses that are not 
>>> a direct result of GC pauses (ParNew, initial mark or remark). When I 
>>> enabled the safepoint logging I saw the same result, long "sync" pause 
>>> times with short spin and block times, usually with the "RevokeBias" 
>>> description. We're seeing pause times sometimes in excess of 10 seconds, so 
>>> it's a pretty debilitating issue. Our machines are not swapping (or even 
>>> close to it) or having other load issues when these pauses occur. Any ideas 
>>> would be very appreciated. Thanks!
>> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Intermittent long application pauses on nodes

2014-10-24 Thread graham sanderson
Actually - there is 

-XX:+SafepointTimeout

which will print out offending threads (assuming you reach a 10 second pause)…

That is probably your best bet.

> On Oct 24, 2014, at 2:38 PM, graham sanderson  wrote:
> 
> This certainly sounds like a JVM bug.
> 
> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
> and don’t seem to have seen this (note we are on 7u67, so that might be an 
> interesting data point, though since the old thread predated that probably 
> not)
> 
> 1) From the app/java side, I’d obviously see if you can identify anything 
> which always coincides with this - repair, compaction etc
> 2) From the VM side (given that this as Benedict mentioned) some threads are 
> taking a long time to rendezvous at the safe point, and it is probably not 
> application threads, I’d look what GC threads, compiler threads etc might be 
> doing. As mentioned it shouldn’t be anything to do with operations which run 
> at a safe point anyway (e.g. scavenge)
>   a) So look at what CMS is doing at the time and see if you can correlate
>   b) Check Oracle for related bugs - didn’t obviously see any, but there 
> have been some complaints related to compilation and safe points
>   c) Add any compilation tracing you can
>   d) Kind of important here - see if you can figure out via dtrace, 
> system tap, gdb or whatever, what the threads are doing when this happens. 
> Sadly it doesn’t look like you can figure out when this is happening (until 
> afterwards) unless you have access to a debug JVM build (and can turn on 
> -XX:+TraceSafepoint and look for a safe point start without a corresponding 
> update within a time period) - if you don’t have access to that, I guess you 
> could try and get a dump every 2-3 seconds (you should catch a 9 second pause 
> eventually!)
> 
>> On Oct 24, 2014, at 12:35 PM, Dan van Kley > > wrote:
>> 
>> I'm also curious to know if this was ever resolved or if there's any other 
>> recommended steps to take to continue to track it down. I'm seeing the same 
>> issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
>> 1.7u71, using the CMS collector. Just as described above, the issue is long 
>> "Total time for which application threads were stopped" pauses that are not 
>> a direct result of GC pauses (ParNew, initial mark or remark). When I 
>> enabled the safepoint logging I saw the same result, long "sync" pause times 
>> with short spin and block times, usually with the "RevokeBias" description. 
>> We're seeing pause times sometimes in excess of 10 seconds, so it's a pretty 
>> debilitating issue. Our machines are not swapping (or even close to it) or 
>> having other load issues when these pauses occur. Any ideas would be very 
>> appreciated. Thanks!
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: are repairs in 2.0 more expensive than in 1.2

2014-10-24 Thread Janne Jalkanen

Commented and added a munin graph, if it helps. For the record, I’m happy with 
-par performance for now.

/Janne

On 24 Oct 2014, at 18:59, Sean Bridges  wrote:

> Janne,
> 
> I filed CASSANDRA-8177 [1] for this.  Maybe comment on the jira that you are 
> having the same problem.
> 
> Sean
> 
> [1]  https://issues.apache.org/jira/browse/CASSANDRA-8177
> 
> On Thu, Oct 23, 2014 at 2:04 PM, Janne Jalkanen  
> wrote:
> 
> On 23 Oct 2014, at 21:29 , Robert Coli  wrote:
> 
>> On Thu, Oct 23, 2014 at 9:33 AM, Sean Bridges  wrote:
>> The change from parallel to sequential is very dramatic.  For a small 
>> cluster with 3 nodes, using cassandra 2.0.10,  a parallel repair takes 2 
>> hours, and io throughput peaks at 6 mb/s.  Sequential repair takes 40 hours, 
>> with average io around 27 mb/s.  Should I file a jira?
>> 
>> As you are an actual user actually encountering the problem I had only 
>> conjectured about, you are the person best suited to file such a ticket on 
>> the reasonableness of the -par default. :D
> 
> Hm?  I’ve been banging my head against the exact same problem (cluster size 
> five nodes, RF=3, ~40GB/node) - paraller repair takes about 6 hrs whereas 
> serial takes some 48 hours or so. In addition, the compaction impact is 
> roughly the same - that is, there’s the same number of compactions triggered 
> per minute, but serial runs eight times more of them. There does not seem to 
> be a difference between the node response latency during parallel or serial 
> repair.
> 
> NB: We do increase our compaction throughput during calmer times, and lower 
> it through busy times, and the serial compaction takes enough time to hit the 
> busy period - that might also have an impact to the overall performance.
> 
> If I had known that this had so far been a theoretical problem, I would’ve 
> spoken up earlier. Perhaps serial repair is not the best default.
> 
> /Janne
> 
> 



Re: Intermittent long application pauses on nodes

2014-10-24 Thread graham sanderson
This certainly sounds like a JVM bug.

We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
and don’t seem to have seen this (note we are on 7u67, so that might be an 
interesting data point, though since the old thread predated that probably not)

1) From the app/java side, I’d obviously see if you can identify anything which 
always coincides with this - repair, compaction etc
2) From the VM side (given that this as Benedict mentioned) some threads are 
taking a long time to rendezvous at the safe point, and it is probably not 
application threads, I’d look what GC threads, compiler threads etc might be 
doing. As mentioned it shouldn’t be anything to do with operations which run at 
a safe point anyway (e.g. scavenge)
a) So look at what CMS is doing at the time and see if you can correlate
b) Check Oracle for related bugs - didn’t obviously see any, but there 
have been some complaints related to compilation and safe points
c) Add any compilation tracing you can
d) Kind of important here - see if you can figure out via dtrace, 
system tap, gdb or whatever, what the threads are doing when this happens. 
Sadly it doesn’t look like you can figure out when this is happening (until 
afterwards) unless you have access to a debug JVM build (and can turn on 
-XX:+TraceSafepoint and look for a safe point start without a corresponding 
update within a time period) - if you don’t have access to that, I guess you 
could try and get a dump every 2-3 seconds (you should catch a 9 second pause 
eventually!)

> On Oct 24, 2014, at 12:35 PM, Dan van Kley  wrote:
> 
> I'm also curious to know if this was ever resolved or if there's any other 
> recommended steps to take to continue to track it down. I'm seeing the same 
> issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
> 1.7u71, using the CMS collector. Just as described above, the issue is long 
> "Total time for which application threads were stopped" pauses that are not a 
> direct result of GC pauses (ParNew, initial mark or remark). When I enabled 
> the safepoint logging I saw the same result, long "sync" pause times with 
> short spin and block times, usually with the "RevokeBias" description. We're 
> seeing pause times sometimes in excess of 10 seconds, so it's a pretty 
> debilitating issue. Our machines are not swapping (or even close to it) or 
> having other load issues when these pauses occur. Any ideas would be very 
> appreciated. Thanks!



smime.p7s
Description: S/MIME cryptographic signature


Dependency Hell: STORM 0.9.2 and Cassandra 2.0

2014-10-24 Thread Gary Zhao
Hello

Anyone encountered the following issue and any workaround? Our Storm
topology was written in Clojure.


Our team is upgrading one of our storm topology from using cassandra 1.2 to
cassandra 2.0, and we have found one problem that is difficult to tackle.
Cassandra 2.0Java driver requires google guava 1.6. Unfortuanately, storm
0.9.2 provides a lower version. Because of that, a topology will not be
able to contact Cassandra databases.

Thanks
Gary


[RELEASE] Apache Cassandra 2.0.11 released

2014-10-24 Thread Sylvain Lebresne
The Cassandra team is pleased to announce the release of Apache Cassandra
version 2.0.11.

Apache Cassandra is a fully distributed database. It is the right choice
when you need scalability and high availability without compromising
performance.

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

 http://cassandra.apache.org/download/

This version is a bug fix release[1] on the 2.0 series. As always, please
pay
attention to the release notes[2] and Let us know[3] if you were to
encounter
any problem.

Enjoy!

[1]: http://goo.gl/pMBdRa (CHANGES.txt)
[2]: http://goo.gl/ZYN0Ji (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA


Re: [RELEASE] Apache Cassandra 2.1.1 released

2014-10-24 Thread Ben Hood
On Fri, Oct 24, 2014 at 6:36 PM, Ben Hood <0x6e6...@gmail.com> wrote:
> Or does the release require time to propagate itself out?

The ccm team inform me that the binaries might take up to 48 hours to
propagate their way out.


Re: [RELEASE] Apache Cassandra 2.1.1 released

2014-10-24 Thread Ben Hood
Thanks very much for this maintenance release :-)

Are there any known issues with ccm on 2.1.1 (see trace below)?

Or does the release require time to propagate itself out?

Traceback (most recent call last):
  File "/usr/local/bin/ccm", line 4, in 
__import__('pkg_resources').run_script('ccm==1.2', 'ccm')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 517, in run_script
"""Add `dist` to working set, associated with `entry`
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1430, in run_script
return real_path
  File 
"/usr/local/lib/python2.7/dist-packages/ccm-1.2-py2.7.egg/EGG-INFO/scripts/ccm",
line 72, in 

  File "build/bdist.linux-x86_64/egg/ccmlib/cmds/cluster_cmds.py",
line 99, in run
  File "build/bdist.linux-x86_64/egg/ccmlib/cluster.py", line 43, in __init__
  File "build/bdist.linux-x86_64/egg/ccmlib/repository.py", line 38, in setup
  File "build/bdist.linux-x86_64/egg/ccmlib/repository.py", line 151,
in download_version

ccmlib.common.ArgumentError: Invalid version 2.1.1 (underlying error
is: HTTP Error 404: Not Found)

On Fri, Oct 24, 2014 at 6:02 PM, Sylvain Lebresne  wrote:
> The Cassandra team is pleased to announce the release of Apache Cassandra
> version 2.1.1.
>
> Apache Cassandra is a fully distributed database. It is the right choice
> when you need scalability and high availability without compromising
> performance.
>
>  http://cassandra.apache.org/
>
> Downloads of source and binary distributions are listed in our download
> section:
>
>  http://cassandra.apache.org/download/
>
> This version is a bug fix release[1] on the 2.1 series. As always, please
> pay
> attention to the release notes[2] and Let us know[3] if you were to
> encounter
> any problem.
>
> Enjoy!
>
> [1]: http://goo.gl/ytYBFb (CHANGES.txt)
> [2]: http://goo.gl/cQW3RF (NEWS.txt)
> [3]: https://issues.apache.org/jira/browse/CASSANDRA
>


Re: Intermittent long application pauses on nodes

2014-10-24 Thread Dan van Kley
I'm also curious to know if this was ever resolved or if there's any other
recommended steps to take to continue to track it down. I'm seeing the same
issue in our production cluster, which is running Cassandra 2.0.10 and JVM
1.7u71, using the CMS collector. Just as described above, the issue is long
"Total time for which application threads were stopped" pauses that are not
a direct result of GC pauses (ParNew, initial mark or remark). When I
enabled the safepoint logging I saw the same result, long "sync" pause
times with short spin and block times, usually with the "RevokeBias"
description. We're seeing pause times sometimes in excess of 10 seconds, so
it's a pretty debilitating issue. Our machines are not swapping (or even
close to it) or having other load issues when these pauses occur. Any ideas
would be very appreciated. Thanks!


[RELEASE] Apache Cassandra 2.1.1 released

2014-10-24 Thread Sylvain Lebresne
The Cassandra team is pleased to announce the release of Apache Cassandra
version 2.1.1.

Apache Cassandra is a fully distributed database. It is the right choice
when you need scalability and high availability without compromising
performance.

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

 http://cassandra.apache.org/download/

This version is a bug fix release[1] on the 2.1 series. As always, please
pay
attention to the release notes[2] and Let us know[3] if you were to
encounter
any problem.

Enjoy!

[1]: http://goo.gl/ytYBFb (CHANGES.txt)
[2]: http://goo.gl/cQW3RF (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA


Re: Empty cqlsh cells vs. null

2014-10-24 Thread Tyler Hobbs
On Fri, Oct 24, 2014 at 6:38 AM, Jens Rantil  wrote:

>
> Just to clarify, I am seeing three types of output for an int field. It’s
> either:
>  * Empty output. Nothing. Nil. Also ‘’.
>  * An integer written in green. Regexp: [0-9]+
>  * Explicitly ‘null’ written in red letters.
>

Some types (including ints) accept an empty string/ByteBuffer as a valid
value.  This is distinct from null, or no cell being present.  This
behavior is primarily a legacy from the Thrift days.

-- 
Tyler Hobbs
DataStax 


Re: are repairs in 2.0 more expensive than in 1.2

2014-10-24 Thread Sean Bridges
Janne,

I filed CASSANDRA-8177 [1] for this.  Maybe comment on the jira that you
are having the same problem.

Sean

[1]  https://issues.apache.org/jira/browse/CASSANDRA-8177

On Thu, Oct 23, 2014 at 2:04 PM, Janne Jalkanen 
wrote:

>
> On 23 Oct 2014, at 21:29 , Robert Coli  wrote:
>
> On Thu, Oct 23, 2014 at 9:33 AM, Sean Bridges 
> wrote:
>
>> The change from parallel to sequential is very dramatic.  For a small
>> cluster with 3 nodes, using cassandra 2.0.10,  a parallel repair takes 2
>> hours, and io throughput peaks at 6 mb/s.  Sequential repair takes 40
>> hours, with average io around 27 mb/s.  Should I file a jira?
>>
>
> As you are an actual user actually encountering the problem I had only
> conjectured about, you are the person best suited to file such a ticket on
> the reasonableness of the -par default. :D
>
>
> Hm?  I’ve been banging my head against the exact same problem (cluster
> size five nodes, RF=3, ~40GB/node) - paraller repair takes about 6 hrs
> whereas serial takes some 48 hours or so. In addition, the compaction
> impact is roughly the same - that is, there’s the same number of
> compactions triggered per minute, but serial runs eight times more of them.
> There does not seem to be a difference between the node response latency
> during parallel or serial repair.
>
> NB: We do increase our compaction throughput during calmer times, and
> lower it through busy times, and the serial compaction takes enough time to
> hit the busy period - that might also have an impact to the overall
> performance.
>
> If I had known that this had so far been a theoretical problem, I would’ve
> spoken up earlier. Perhaps serial repair is not the best default.
>
> /Janne
>
>


Re: Empty cqlsh cells vs. null

2014-10-24 Thread Jens Rantil
> What do you mean by “cqlsh explicitely writes ‘null’ in those cells” ?  Are 
> you seing textual value “null” written in the cells ?




Just to clarify, I am seeing three types of output for an int field. It’s 
either:

 * Empty output. Nothing. Nil. Also ‘’.

 * An integer written in green. Regexp: [0-9]+

 * Explicitly ‘null’ written in red letters.




My question concerns what the difference between Empty output and ‘null’ is. 
I’m also curious how my Datastax Java driver will handle this, but that’ll be 
my next quest, I guess.




Thanks,

Jens


———
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook Linkedin Twitter

On Thu, Oct 23, 2014 at 4:36 PM, DuyHai Doan  wrote:

> Hello Jens
> What do you mean by "cqlsh explicitely writes 'null' in those cells" ?  Are
> you seing textual value "null" written in the cells ?
>  Null in CQL can have 2 meanings:
> 1. the column did not exist (or more precisely, has never been created)
> 2. the column did exist sometimes in the past (has been created) but then
> has been deleted (tombstones)
> On Thu, Oct 23, 2014 at 8:37 AM, Jens Rantil  wrote:
>>  Hi,
>>
>> Not sure this is a Datastax specific question to be asked elsewhere. In
>> that case, let me know.
>>
>> Anyway, I have populated a Cassandra table from DSE Hive. When I fire up
>> cqlsh and execute a SELECT against the table I have columns of INT type
>> that are empty. At first I thought these were null, but it turns out that
>> cqlsh explicitly writes "null" in those cells. What can I make of this? A
>> bug in Hive serialization to Cassandra?
>>
>> Cheers,
>> Jens
>>
>> —
>> Sent from Mailbox 
>>

Re: Empty cqlsh cells vs. null

2014-10-24 Thread Jens Rantil
This is interesting, because I am definitely seeing three different types of 
values. See attached screenshot and link.




Link: https://gist.github.com/JensRantil/d162801812ca48ad3f75

Image/screenshot: https://www.dropbox.com/s/vczzgrf0vk9adzk/cqlsh-int.png?dl=0


Cheers,

Jens


———
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook Linkedin Twitter

On Thu, Oct 23, 2014 at 4:40 PM, Adam Holmberg 
wrote:

> 'null' is how cqlsh displays empty cells:
> https://github.com/apache/cassandra/blob/trunk/pylib/cqlshlib/formatting.py#L47-L58
> On Thu, Oct 23, 2014 at 9:36 AM, DuyHai Doan  wrote:
>> Hello Jens
>>
>> What do you mean by "cqlsh explicitely writes 'null' in those cells" ?
>> Are you seing textual value "null" written in the cells ?
>>
>>
>>  Null in CQL can have 2 meanings:
>>
>> 1. the column did not exist (or more precisely, has never been created)
>> 2. the column did exist sometimes in the past (has been created) but then
>> has been deleted (tombstones)
>>
>>
>>
>> On Thu, Oct 23, 2014 at 8:37 AM, Jens Rantil  wrote:
>>
>>>  Hi,
>>>
>>> Not sure this is a Datastax specific question to be asked elsewhere. In
>>> that case, let me know.
>>>
>>> Anyway, I have populated a Cassandra table from DSE Hive. When I fire up
>>> cqlsh and execute a SELECT against the table I have columns of INT type
>>> that are empty. At first I thought these were null, but it turns out that
>>> cqlsh explicitly writes "null" in those cells. What can I make of this? A
>>> bug in Hive serialization to Cassandra?
>>>
>>> Cheers,
>>> Jens
>>>
>>> —
>>> Sent from Mailbox 
>>>
>>
>>