Reaper 1.2 released

2018-07-24 Thread Jonathan Haddad
Hey folks,

Just wanted to share with the list that after a bit of a long wait, we've
released Reaper 1.2.  We have a short blog post here outlining the new
features: https://twitter.com/TheLastPickle/status/1021830663605870592

With each release we've worked on performance improvements and stability as
our primary focus.  We're helping quite a few teams manage repair on
hundreds or thousands of nodes with Reaper so first and foremost we need it
to work well at that sort of scale :)

We also recognize the need for features other than repair, which is why
we've added support for taking & listing cluster wide snapshots.

Looking forward, we're planning on adding support for more operations and
reporting - we're already got some code to pull thread pool stats and we'd
like to expose a lot of table level information as well.

http://cassandra-reaper.io/

Looking forward to hearing your feedback!
-- 
Jon Haddad
Principal Consultant, The Last Pickle


Re: apache cassandra development process and future

2018-07-24 Thread Jeremy Hanna
For full disclosure, I've been in the Apache Cassandra community since 2010 and 
at DataStax since 2012.

So DataStax moved on to focus on things for their customers, effectively 
putting most development effort into DataStax Enterprise.  However, there have 
been a lot of fixes and improvements contributed to the open-source project.  
As far as I can tell from running gitinspect over the project over the last 
year, not only have there been individuals working at Apple and DataStax that 
have contributed a large amount of code, but also from a variety of 
consultancies (e.g. The Last Pickle) and companies such as Netflix, Uber, 
Instagram, Instacluster and many others.  That's from the perspective of code 
contribution.  There are also dev list and jira ticket discussions, jira ticket 
creation (bugs, features, etc), contribution of documentation (though that's 
rolled up in the codebase), and certainly the invaluable help people give on 
the mailing list, irc, stack overflow, blog posts, etc.  Having tried to help 
promote Cassandra for many years, I'm really happy to see the project get its 
footing and a good cadence like others have said on this thread.

> Is [DataStax's] new software incompatible with Cassandra?

I can't speak for DataStax, but I believe it will always be compatible from a 
driver/protocol/API perspective.  It will be additive - with the features 
around search indexes, analytics, graph, and security along with stuff like the 
nodesync.

For popularity of distributions, I would guess that it's Apache Cassandra first 
and DataStax Enterprise second.  I think Cosmos with an Apache Cassandra API is 
way down the list.  I don't know of anyone using it and I can't find any public 
use cases or blogs about it - happy to be corrected.

> On Jul 19, 2018, at 9:04 AM, Jeff Jirsa  wrote:
> 
> It will (did) slow, but it didn’t (won’t) stop. There’s some really 
> interesting work in the queue, like 
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-14404 
>  
> , that should make a lot of users very happy. 
> 
> -- 
> Jeff Jirsa
> 
> 
> On Jul 19, 2018, at 6:59 AM, Vitaliy Semochkin  > wrote:
> 
>> Jeff and Rahul thank you very much for clarification.
>> My main concern was the fact that since DataStax left Cassandra
>> project it is unclear if the development speed will significantly slow
>> down,
>> even now it seems documentation site seems abandoned. Though players
>> like Netflix, Apple and Microsoft look promising.
>> On Wed, Jul 18, 2018 at 6:49 PM Rahul Singh
>> mailto:rahul.xavier.si...@gmail.com>> wrote:
>>> 
>>> YgaByte!!! <— another Cassandra “compliant" DB - not sure if they 
>>> forked C* or wrote Cassandra in go. ;)
>>> https://github.com/YugaByte/yugabyte-db 
>>> 
>>> 
>>> Datastax is Cassandra compliant — and can use the same sstables at least 
>>> until 6.0 (which uses a patched version of  “4.0” which is 2-5x faster) — 
>>> and has the same actual tools that are in the OS version.
>>> 
>>> Here are some signals from the big players that are understanding it’s 
>>> power and need.
>>> 
>>> 1. Azure CosmosDB has a C* compliant API - seems like Managed C* under the 
>>> hood. They used ElasticSearch to run their Azure Search …
>>> 2. Oracle now has a Datastax offering
>>> 3. Mesosphere offers supported versions of Cassandra and Datastax
>>> 4. Kubernetes and related purveyors use Cassandra as prime example as a 
>>> part of a Kubernetes backed cloud agnostic orchestration framework
>>> 5. What Alain mentioned earlier.
>>> 
>>> 
>>> --
>>> Rahul Singh
>>> rahul.si...@anant.us 
>>> 
>>> Anant Corporation
>>> On Jul 18, 2018, 9:35 AM -0400, Alain RODRIGUEZ >> >, wrote:
>>> 
>>> Hello,
>>> 
>>> It's a complex topic that has already been extensively discussed (at least 
>>> for the part about Datastax). I am sharing my personal understanding, from 
>>> what I read in the mailing list mostly:
>>> 
 Recently Cassandra eco system became very fragmented
>>> 
>>> 
>>> I would not put Scylladb in the same 'eco system' than Apache Cassandra. I 
>>> believed it is inspired by Cassandra and claim to be compatible with it up 
>>> to a certain point, but it's not the same software, thus not the same users 
>>> and community.
>>> 
>>> About Datastax, I think they will give you a better idea of their position 
>>> by themselves here or through their support. I believe they also 
>>> communicated about it already. But in any case, I see Datastax more in the 
>>> same 'eco system' than Scylladb. Datastax uses a patched/forked version of 
>>> Cassandra (+ some other tools integrated with Cassandra and support). Plus 
>>> it goes both ways, Datastax greatly contributed to making Cassandra what it 
>>> is now and relies on it (or use to do so at least). I don't think 

Cassandra crashes after loading data with sstableloader

2018-07-24 Thread Arpan Khandelwal
I need to clone data from one keyspace to another keyspace.
We do it by taking snapshot of keyspace1 and restoring in keyspace2 using
sstableloader.

Suppose we have following table with index on hash column. Table has around
10M rows.
-
CREATE TABLE message (
 id uuid,
 messageid uuid,
 parentid uuid,
 label text,
 properties map,
 text1 text,
 text2 text,
 text3 text,
 category text,
 hash text,
 info map,
 creationtimestamp bigint,
 lastupdatedtimestamp bigint,
 PRIMARY KEY ( (id) )
 );

CREATE  INDEX  ON message ( hash );
-
Cassandra crashes when i load data using sstableloader. Load is happening
correctly but seems that cassandra crashes when its trying to build index
on table with huge data.

I have two questions.
1. Is there any better way to clone keyspace?
2. How can i optimize sstableloader to load data and not crash cassandra
while building index.

Thanks
Arpan


Re: concurrent_compactors via JMX

2018-07-24 Thread Alain RODRIGUEZ
Hello Ricardo,

My understanding is that GP2 is better. I think we did some testing in the
past, but I must say I do not remember the exact results. I remember we
also thought of IO1 at some point, but we were not convinced by this kind
of EBS (not sure if it was not as performant as suggested in the doc or
just much more expensive). Maybe test it and make your own idea or wait for
someone else's information.

Be aware that the size of the GP2 EBS is impacting the IOPS, the max IOPS
is reached at ~ 3.334 TB which is also a good dataset size for Cassandra
(1.5/2 TB with some spared space for compactions).

I'd like to deploy on i3.xlarge
>

Yet if you go for I3, of course, use the ephemeral drives (NVMe). It's
incredibly fast ;-). Compared with m1.xlarge you should see a
substantial difference. The problem is that with a low number of nodes, it
will always cost more to have i3 than m1. This is often not the case with
more machines, as each node will work way more efficiently and you can
effectively reduce the number of nodes. Here, 3 will probably be the
minimum of nodes and 3 x i3 might cost more than 5/6 x m1 instances. When
scaling up though, you should come back to an acceptable cost/efficiency.
It's your call to see if to continue with m1, m5 or r4 instances meanwhile.

I decided to get safe and scale horizontally with the hardware we have
> tested
>

Yes, this is fair enough and a safe approach. To add new hardware the best
approach is a data center switch (I will write a post about how to do this
sometime soon)

I'm preparing to migrate inside vpc
>

This too is probably through a DC switch. I reminded I asked for help on
this in 2014, I found the reference for you where I published the steps
that I went through to go from EC2 public --> public VPC  --> private VPC.
It's old and I did not read it again, but it worked for us and at that
time. I hope you might find it useful as well as the process is detailed
step by step. It should be  easy to adapt it and you should not forget any
step this way:
http://grokbase.com/t/cassandra/user/1465m9txtw/vpc-aws#20140612k7xq0t280cvyk6waeytxbkx40c


possibly in Multi-AZ.
>

Yes, I recommend you to do this. It's incredibly powerful when you now that
with 3 racks and a RF=3 (and proper topology/configuration), each rack owes
100% of the data. Thus when operating you can work on a rack at once with
limited risk, even using quorum, service should stay up, no matter what
happens as long as 2 AZs are completely available. When cluster will grow
you might really appreciate this to prevent some failures and operate
safely.

PS: I defintely own you a coffee, actually much more than that!


If we meet we can definitely share a beer (no coffee for me, but I never
say no to a beer ;-)).
But you don't owe me it was and still is for free. Here we all share,
period. I like to think that knowledge is the only wealth you can give away
while keeping it for yourself. Some even say that knowledge grows when
shared. I used this mailing list myself to ramp up with Cassandra, I am
myself probably paying back to the community somehow for years now :-). Now
it's even part of my job, this is a part of what we do :). And I like it.
What I invite you to do is help people around yourself when you will be
comfortable with some topics. This way someone else might enjoy this
mailing list, making it a nicer place and contributing to growing up the
community ;-).

Yet, be ensured I appreciate the feedback and that you are grateful, it
shows it was somehow useful to you. This is enough for me.

C*heers
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-07-19 19:21 GMT+02:00 Riccardo Ferrari :

> Alain,
>
> I really appreciate your answers! A little typo is not changing the
> valuable content! For sure I will give a shot to your GC settings and come
> back with my findings.
> Right now I have 6 nodes up and running and everything looks good so far
> (at least much better).
>
> I agree, the hardware I am using is quite old but rather experimenting
> with new hardware combinations (on prod) I decided to get safe and scale
> horizontally with the hardware we have tested. I'm preparing to migrate
> inside vpc and I'd like to deploy on i3.xlarge instances and possibly in
> Multi-AZ.
>
> Speaking of EBS: I gave a quick I/O test to m3.xlarge + SSD + EBS (400
> PIOPS). SSD looks great for commitlogs, EBS I might need more guidance. I
> certainly gain in terms of random i/o however I'd like to hear what is your
> stand wrt IO2 (PIOPS) vs regular GP2? Or better: what are you guidelines
> when using EBS?
>
> Thanks!
>
> PS: I defintely own you a coffee, actually much more than that!
>
> On Thu, Jul 19, 2018 at 6:24 PM, Alain RODRIGUEZ 
> wrote:
>
>> Ah excuse my confusion. I now understand I guide you through changing the
>>> throughput when you wanted to change the compaction 

Re: Why and How is Cassandra using all my ram ?

2018-07-24 Thread Léo FERLIN SUTTON
On Tue, Jul 24, 2018 at 4:04 AM, Dennis Lovely  wrote:
> you define the max size of your heap (-Xmx), but you do not define the max
> size of your offheap (MaxMetaspaceSize for jdk 8, PermSize for jdk7), so you
> could occupy all of the memory on the instance.

Yes I think we should set up a MaxMetaspaceSize.
I am still going to try to find out why the ram is being used.

> you should also take into account that the memory size per stack (Xss) is 
> -ontop-
> of what you define for the heap, and offheap, so number of spawned threads
> could be a culprit as well, if you tune your offheap size and keep seeing the 
> same
> trouble.  id figure out approximately how many thread stacks are getting 
> created,
> times that by 256k, add that to your heap size, and subtract that number from 
> the total
> amount of memory available to the host to come to a proper offheap size.

I have checked the amount of spawned threads (visible with `nodetool
tpstats` or `ps -T -p `) and it's too low to be the cause of all
the memory consumption. I have about 650 threads so it's less than 1GB
memory used.

Thank you for the suggestions Dennis.

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Why and How is Cassandra using all my ram ?

2018-07-24 Thread Léo FERLIN SUTTON
On Mon, Jul 23, 2018 at 11:44 PM, Mark Rose  wrote:
> Hi Léo,
>
> It's possible that glibc is creating too many memory arenas. Are you
> setting/exporting MALLOC_ARENA_MAX to something sane before calling
> the JVM? You can check that in /proc//environ.
>

I have checked and the MALLOC_ARENA_MAX is set to `4`.
I will read up the documentation about it before trying ti change it
to make sure I understand the consequences.

> I would also turn on -XX:NativeMemoryTracking=summary and use jcmd to
> check out native memory usage from the JVM's perspective.
>

I will try it on a third of the cluster to see if I noticed anything !

Thank you Mark for your suggestions/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org