certificate pinning feature

2017-06-09 Thread Victor Ashik
Hello,



Is it possible to have a CA certificates in truststores, but do any kind of 
certificate pinning, i.e. adding extra requirements for certificates (matching 
hostname or thumbprint) to be trusted by Cassandra for internode and/or client 
communication?



The only way to achieve this I was able to find so far is to have only trusted 
certificates in truststores and do not have CA certificates there at all, but 
this will require to change truststores and restart nodes for adding new 
certificates.





--

Regards,

Victor Ashik



Re: Partition range incremental repairs

2017-06-09 Thread Chris Stokesmore
> 
> I can't recommend *anyone* use incremental repair as there's some pretty 
> horrible bugs in it that can cause Merkle trees to wildly mismatch & result 
> in massive overstreaming.  Check out 
> https://issues.apache.org/jira/browse/CASSANDRA-9143 
> .  
> 
> TL;DR: Do not use incremental repair before 4.0.

Hi Jonathan,

Thanks for your reply, this is a slightly scary message for us! 2.2 has been 
out for nearly 2 years and incremental repairs are the default - and it has 
horrible bugs!?
I guess massive over streaming while a performance issue, does not affect data 
integrity..

Are there any plans to back port this to 3 or ideally 2.2 ?

Chris



> On Tue, Jun 6, 2017 at 9:54 AM Anuj Wadehra  
> wrote:
> Hi Chris,
> 
> Can your share following info:
> 
> 1. Exact repair commands you use for inc repair and pr repair
> 
> 2. Repair time should be measured at cluster level for inc repair. So, whats 
> the total time it takes to run repair on all nodes for incremental vs pr 
> repairs?
> 
> 3. You are repairing one dc DC3. How many DCs are there in total and whats 
> the RF for keyspaces? Running pr on a specific dc would not repair entire 
> data.
> 
> 4. 885 ranges? From where did you get this number? Logs? Can you share the 
> number ranges printed in logs for both inc and pr case?
> 
> 
> Thanks
> Anuj
> 
> 
> Sent from Yahoo Mail on Android 
> 
> On Tue, Jun 6, 2017 at 9:33 PM, Chris Stokesmore
> > wrote:
> Thank you for the excellent and clear description of the different versions 
> of repair Anuj, that has cleared up what I expect to be happening.
> 
> The problem now is in our cluster, we are running repairs with options 
> (parallelism: parallel, primary range: false, incremental: true, job threads: 
> 1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 885) and 
> when we do our repairs are taking over a day to complete when previously when 
> running with the partition range option they were taking more like 8-9 hours.
> 
> As I understand it, using incremental should have sped this process up as all 
> three sets of data on each repair job should be marked as repaired however 
> this does not seem to be the case. Any ideas?
> 
> Chris
> 
>> On 6 Jun 2017, at 16:08, Anuj Wadehra > > wrote:
>> 
>> Hi Chris,
>> 
>> Using pr with incremental repairs does not make sense. Primary range repair 
>> is an optimization over full repair. If you run full repair on a n node 
>> cluster with RF=3, you would be repairing each data thrice. 
>> E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C . 
>> When full repair is run on node A, the entire data in that range gets synced 
>> with replicas on node B and C. Now, when you run full repair on nodes B and 
>> C, you are wasting resources on repairing data which is already repaired. 
>> 
>> Primary range repair ensures that when you run repair on a node, it ONLY 
>> repairs the data which is owned by the node. Thus, no node repairs data 
>> which is not owned by it and must be repaired by other node. Redundant work 
>> is eliminated. 
>> 
>> Even in pr, each time you run pr on all nodes, you repair 100% of data. Why 
>> to repair complete data in each cycle?? ..even data which has not even 
>> changed since the last repair cycle?
>> 
>> This is where Incremental repair comes as an improvement. Once repaired, a 
>> data would be marked repaired so that the next repair cycle could just focus 
>> on repairing the delta. Now, lets go back to the example of 5 node cluster 
>> with RF =3.This time we run incremental repair on all nodes. When you repair 
>> entire data on node A, all 3 replicas are marked as repaired. Even if you 
>> run inc repair on all ranges on the second node, you would not re-repair the 
>> already repaired data. Thus, there is no advantage of repairing only the 
>> data owned by the node (primary range of the node). You can run inc repair 
>> on all the data present on a node and Cassandra would make sure that when 
>> you repair data on other nodes, you only repair unrepaired data.
>> 
>> Thanks
>> Anuj
>> 
>> 
>> 
>> Sent from Yahoo Mail on Android 
>> 
>> On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmore
>> > wrote:
>> Hi all,
>> 
>> Wondering if anyone had any thoughts on this? At the moment the long running 
>> repairs cause us to be running them on two nodes at once for a bit of time, 
>> which obivould increases the cluster load.
>> 
>> On 2017-05-25 16:18 (+0100), Chris Stokesmore > > wrote: 
>> > Hi,> 
>> > 
>> > We are running a 7 node Cassandra 2.2.8 

Re: Partition range incremental repairs

2017-06-09 Thread Chris Stokesmore
Hi Anuj,

Thanks for the reply.

1). We are using Cassandra 2.2.8, and our repair commands we are comparing are 
"nodetool repair --in-local-dc --partitioner-range” and 
"nodetool repair --in-local-dc”
Since 2.2 I believe inc repairs are the default - that seems to be confirmed in 
the logs that list the repair details when a repair starts.

2) From looks at a few runsr, on average:
with -pr repairs, each node is approx 6.5 - 8 hours, so a total over the 7 
nodes of 53 hours
With just inc repairs, each node ~26 - 29 hours, so a total of 193

3) we currently have two DCs in total, the ‘production’ ring with 7 nodes and 
RF=3, and a testing ring with one single node and RF=1 for our single keyspace 
we currently use.

4) Yeah that number came from the Cassandra repair logs from an inc repair, I 
can share the number reports when using a pr repair later this evening when the 
currently running repair has completed.


Many thanks for the reply again,

Chris


> On 6 Jun 2017, at 17:50, Anuj Wadehra  wrote:
> 
> Hi Chris,
> 
> Can your share following info:
> 
> 1. Exact repair commands you use for inc repair and pr repair
> 
> 2. Repair time should be measured at cluster level for inc repair. So, whats 
> the total time it takes to run repair on all nodes for incremental vs pr 
> repairs?
> 
> 3. You are repairing one dc DC3. How many DCs are there in total and whats 
> the RF for keyspaces? Running pr on a specific dc would not repair entire 
> data.
> 
> 4. 885 ranges? From where did you get this number? Logs? Can you share the 
> number ranges printed in logs for both inc and pr case?
> 
> 
> Thanks
> Anuj
> 
> 
> Sent from Yahoo Mail on Android 
> 
> On Tue, Jun 6, 2017 at 9:33 PM, Chris Stokesmore
>  wrote:
> Thank you for the excellent and clear description of the different versions 
> of repair Anuj, that has cleared up what I expect to be happening.
> 
> The problem now is in our cluster, we are running repairs with options 
> (parallelism: parallel, primary range: false, incremental: true, job threads: 
> 1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 885) and 
> when we do our repairs are taking over a day to complete when previously when 
> running with the partition range option they were taking more like 8-9 hours.
> 
> As I understand it, using incremental should have sped this process up as all 
> three sets of data on each repair job should be marked as repaired however 
> this does not seem to be the case. Any ideas?
> 
> Chris
> 
>> On 6 Jun 2017, at 16:08, Anuj Wadehra > > wrote:
>> 
>> Hi Chris,
>> 
>> Using pr with incremental repairs does not make sense. Primary range repair 
>> is an optimization over full repair. If you run full repair on a n node 
>> cluster with RF=3, you would be repairing each data thrice. 
>> E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C . 
>> When full repair is run on node A, the entire data in that range gets synced 
>> with replicas on node B and C. Now, when you run full repair on nodes B and 
>> C, you are wasting resources on repairing data which is already repaired. 
>> 
>> Primary range repair ensures that when you run repair on a node, it ONLY 
>> repairs the data which is owned by the node. Thus, no node repairs data 
>> which is not owned by it and must be repaired by other node. Redundant work 
>> is eliminated. 
>> 
>> Even in pr, each time you run pr on all nodes, you repair 100% of data. Why 
>> to repair complete data in each cycle?? ..even data which has not even 
>> changed since the last repair cycle?
>> 
>> This is where Incremental repair comes as an improvement. Once repaired, a 
>> data would be marked repaired so that the next repair cycle could just focus 
>> on repairing the delta. Now, lets go back to the example of 5 node cluster 
>> with RF =3.This time we run incremental repair on all nodes. When you repair 
>> entire data on node A, all 3 replicas are marked as repaired. Even if you 
>> run inc repair on all ranges on the second node, you would not re-repair the 
>> already repaired data. Thus, there is no advantage of repairing only the 
>> data owned by the node (primary range of the node). You can run inc repair 
>> on all the data present on a node and Cassandra would make sure that when 
>> you repair data on other nodes, you only repair unrepaired data.
>> 
>> Thanks
>> Anuj
>> 
>> 
>> 
>> Sent from Yahoo Mail on Android 
>> 
>> On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmore
>> > wrote:
>> Hi all,
>> 
>> Wondering if anyone had any thoughts on this? At the moment the long running 
>> repairs cause us to be running them on two nodes at once for a bit of time, 
>> which obivould 

Re: Huge Batches

2017-06-09 Thread techpyaasa .
Hi Justin,

We have very few columns in PK(max 2 partition columns , max 2 clustering
columns) and it wont have huge data/huge number of primary keys.
I just wanted to print the names & values of these columns for huge batches.

PS: we are using c*-2.1

Thanks for reply @Justin and @Akhil

On Fri, Jun 9, 2017 at 5:31 AM, Justin Cameron 
wrote:

> I don't believe the keys within a large batch are logged by Cassandra. A
> large batch could potentially contain tens of thousands of primary keys, so
> this could quickly fill up the logs.
>
> Here are a couple of suggestions:
>
>- Large batches should also be slow, so you could try setting up slow
>query logging in the Java driver and see what gets caught:
>https://docs.datastax.com/en/developer/java-driver/3.2/manual/logging/
>
>- You could write your own custom QueryHandler to log those details on
>the server-side, as described here: https://www.slideshare.
>net/planetcassandra/cassandra-summit-2014-lesser-known-
>features-of-cassandra-21
>
> 
>
>
> Cheers,
> Justin
>
> On Thu, 8 Jun 2017 at 18:49 techpyaasa .  wrote:
>
>> Hi ,
>>
>> Recently we are seeing huge batches and log prints as below in c* logs
>>
>>
>> *Batch of prepared statements for [ks1.cf1] is of size 413350, exceeding
>> specified threshold of 5120 by 362150*
>> Along with the Column Family name (as found in above log print) , we
>> would like to know the partion key , cluster column values(along with their
>> names) too , so that it would be easy to trace out the user who is
>> inserting such huge batches.
>>
>> I tried to see code base of c* as below, but could not figure out how to
>> get values of partition keys , values of cluster columns. :(
>> Can some one please help me out...
>>
>>* public static void verifyBatchSize(Iterable cfs)*
>> *{*
>> *long size = 0;*
>> *long warnThreshold =
>> DatabaseDescriptor.getBatchSizeWarnThreshold();*
>>
>> *for (ColumnFamily cf : cfs)*
>> *size += cf.dataSize();*
>>
>> *if (size > warnThreshold)*
>> *{*
>> *Set ksCfPairs = new HashSet<>();*
>> *for (ColumnFamily cf : cfs)*
>> *{*
>> *ksCfPairs.add(String.format("%s.%s size=%s",
>> cf.metadata().ksName, cf.metadata().cfName , cf.dataSize()));*
>> *Iterator cns = cf.getColumnNames().iterator();*
>> *CellName cn = cns.next();*
>> *cn.dataSize();*
>> *}*
>>
>> *String format = "Batch of prepared statements for {} is of
>> size {}, exceeding specified threshold of {} by {}.";*
>> *logger.warn(format, ksCfPairs, size, warnThreshold, size -
>> warnThreshold);*
>> *}*
>> *}*
>>
>>
>> Thanks
>>
>> TechPyaasa
>>
> --
>
>
> *Justin Cameron*Senior Software Engineer
>
>
> 
>
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>


Re: Reg:- Data Modelling For Hierarchy Data

2017-06-09 Thread @Nandan@
MV is a good option, but in case of RF, suppose If we are using RF =3 then
it will duplicate on 3*3 times which will be unwanted in case we insert a
lot of users.
But I think we can go with MV also.


On Fri, Jun 9, 2017 at 4:41 PM, Jacques-Henri Berthemet <
jacques-henri.berthe...@genesys.com> wrote:

> For query 2) you should have a second table, secondary index is usually
> never recommended. If you’re planning to use Cassandra 3.x you should take
> a look at materialized views (MVs):
>
> http://cassandra.apache.org/doc/latest/cql/mvs.html
>
> https://opencredo.com/everything-need-know-cassandra-materialized-views/
>
>
>
> I don’t have experience on MVs, I’m stuck on 2.2 for now.
>
>
>
> Regards,
>
> *--*
>
> *Jacques-Henri Berthemet*
>
>
>
> *From:* @Nandan@ [mailto:nandanpriyadarshi...@gmail.com]
> *Sent:* vendredi 9 juin 2017 10:27
> *To:* Jacques-Henri Berthemet 
> *Cc:* user@cassandra.apache.org
> *Subject:* Re: Reg:- Data Modelling For Hierarchy Data
>
>
>
> Hi,
>
> Yes, I am following with single Users table.
>
> Suppose my query patterns are:-
>
> 1) Select user by email.
>
> 2) Select user by user_type
>
> 1st query pattern will satisfy the Users table, but in the case of second
> query pattern, either have to go with another table like user_by_type or I
> have to create secondary index on user_type by which client will able to
> access Only Buyer or Seller Records.
>
>
>
> Please suggest the best way.
>
> Best Regards.
>
> Nandan
>
>
>
> On Fri, Jun 9, 2017 at 3:59 PM, Jacques-Henri Berthemet <
> jacques-henri.berthe...@genesys.com> wrote:
>
> Hi,
>
>
>
> According to your model a use can only be of one type, so I’d go with a
> very simple model with a single table:
>
>
>
> string email (PK), string user_type, map attributes
>
>
>
> user_type can be Buyer, Master_Seller, Slave_Seller and all other columns
> go into attribute map as long as all of them don’t exceed 64k, but you
> could create dedicate columns for all attributes that you know will always
> be there.
>
>
>
> *--*
>
> *Jacques-Henri Berthemet*
>
>
>
> *From:* @Nandan@ [mailto:nandanpriyadarshi...@gmail.com]
> *Sent:* vendredi 9 juin 2017 03:14
> *To:* user@cassandra.apache.org
> *Subject:* Reg:- Data Modelling For Hierarchy Data
>
>
>
> Hi,
>
>
>
> I am working on Music database where we have multiple order of users of
> our portal. Different category of users is having some common attributes
> but some different attributes based on their registration.
>
> This becomes a hierarchy pattern. I am attaching one sample hierarchy
> pattern of User Module which is somehow part of my current data modeling.
>
>
>
> *There are few conditions:-*
>
> *1) email id should be unique. i.e If some user registered with one email
> id then that particular user can't able to register as another user. *
>
> *2) Some type of users having 20-30 columns as in their registration. such
> as company,address,email,first_name,join_date etc..*
>
>
>
> *Query pattern is like:-*
>
> *1) select user by email*
>
>
>
> Please suggest me how to do data modeling for these type of
> hierarchy data.
>
> Should I create a seperate table for the seperate type of users or should
> I go with single user table?
>
> As we have unique email id condition, so should I go with email id as a
> primary key or user_id UUID will be the best choice.
>
>
>
>
>
>
>
> Best regards,
>
> Nandan Priyadarshi
>
>
>


RE: Reg:- Data Modelling For Hierarchy Data

2017-06-09 Thread Jacques-Henri Berthemet
For query 2) you should have a second table, secondary index is usually never 
recommended. If you’re planning to use Cassandra 3.x you should take a look at 
materialized views (MVs):
http://cassandra.apache.org/doc/latest/cql/mvs.html
https://opencredo.com/everything-need-know-cassandra-materialized-views/

I don’t have experience on MVs, I’m stuck on 2.2 for now.

Regards,
--
Jacques-Henri Berthemet

From: @Nandan@ [mailto:nandanpriyadarshi...@gmail.com]
Sent: vendredi 9 juin 2017 10:27
To: Jacques-Henri Berthemet 
Cc: user@cassandra.apache.org
Subject: Re: Reg:- Data Modelling For Hierarchy Data

Hi,
Yes, I am following with single Users table.
Suppose my query patterns are:-
1) Select user by email.
2) Select user by user_type
1st query pattern will satisfy the Users table, but in the case of second query 
pattern, either have to go with another table like user_by_type or I have to 
create secondary index on user_type by which client will able to access Only 
Buyer or Seller Records.

Please suggest the best way.
Best Regards.
Nandan

On Fri, Jun 9, 2017 at 3:59 PM, Jacques-Henri Berthemet 
>
 wrote:
Hi,

According to your model a use can only be of one type, so I’d go with a very 
simple model with a single table:

string email (PK), string user_type, map attributes

user_type can be Buyer, Master_Seller, Slave_Seller and all other columns go 
into attribute map as long as all of them don’t exceed 64k, but you could 
create dedicate columns for all attributes that you know will always be there.

--
Jacques-Henri Berthemet

From: @Nandan@ 
[mailto:nandanpriyadarshi...@gmail.com]
Sent: vendredi 9 juin 2017 03:14
To: user@cassandra.apache.org
Subject: Reg:- Data Modelling For Hierarchy Data

Hi,

I am working on Music database where we have multiple order of users of our 
portal. Different category of users is having some common attributes but some 
different attributes based on their registration.
This becomes a hierarchy pattern. I am attaching one sample hierarchy pattern 
of User Module which is somehow part of my current data modeling.

There are few conditions:-
1) email id should be unique. i.e If some user registered with one email id 
then that particular user can't able to register as another user.
2) Some type of users having 20-30 columns as in their registration. such as 
company,address,email,first_name,join_date etc..

Query pattern is like:-
1) select user by email

Please suggest me how to do data modeling for these type of hierarchy data.
Should I create a seperate table for the seperate type of users or should I go 
with single user table?
As we have unique email id condition, so should I go with email id as a primary 
key or user_id UUID will be the best choice.



Best regards,
Nandan Priyadarshi



Re: Reg:- Data Modelling For Hierarchy Data

2017-06-09 Thread @Nandan@
Hi,
Yes, I am following with single Users table.
Suppose my query patterns are:-
1) Select user by email.
2) Select user by user_type
1st query pattern will satisfy the Users table, but in the case of second
query pattern, either have to go with another table like user_by_type or I
have to create secondary index on user_type by which client will able to
access Only Buyer or Seller Records.

Please suggest the best way.
Best Regards.
Nandan

On Fri, Jun 9, 2017 at 3:59 PM, Jacques-Henri Berthemet <
jacques-henri.berthe...@genesys.com> wrote:

> Hi,
>
>
>
> According to your model a use can only be of one type, so I’d go with a
> very simple model with a single table:
>
>
>
> string email (PK), string user_type, map attributes
>
>
>
> user_type can be Buyer, Master_Seller, Slave_Seller and all other columns
> go into attribute map as long as all of them don’t exceed 64k, but you
> could create dedicate columns for all attributes that you know will always
> be there.
>
>
>
> *--*
>
> *Jacques-Henri Berthemet*
>
>
>
> *From:* @Nandan@ [mailto:nandanpriyadarshi...@gmail.com]
> *Sent:* vendredi 9 juin 2017 03:14
> *To:* user@cassandra.apache.org
> *Subject:* Reg:- Data Modelling For Hierarchy Data
>
>
>
> Hi,
>
>
>
> I am working on Music database where we have multiple order of users of
> our portal. Different category of users is having some common attributes
> but some different attributes based on their registration.
>
> This becomes a hierarchy pattern. I am attaching one sample hierarchy
> pattern of User Module which is somehow part of my current data modeling.
>
>
>
> *There are few conditions:-*
>
> *1) email id should be unique. i.e If some user registered with one email
> id then that particular user can't able to register as another user. *
>
> *2) Some type of users having 20-30 columns as in their registration. such
> as company,address,email,first_name,join_date etc..*
>
>
>
> *Query pattern is like:-*
>
> *1) select user by email*
>
>
>
> Please suggest me how to do data modeling for these type of
> hierarchy data.
>
> Should I create a seperate table for the seperate type of users or should
> I go with single user table?
>
> As we have unique email id condition, so should I go with email id as a
> primary key or user_id UUID will be the best choice.
>
>
>
>
>
>
>
> Best regards,
>
> Nandan Priyadarshi
>


RE: Reg:- Data Modelling For Hierarchy Data

2017-06-09 Thread Jacques-Henri Berthemet
Hi,

According to your model a use can only be of one type, so I’d go with a very 
simple model with a single table:

string email (PK), string user_type, map attributes

user_type can be Buyer, Master_Seller, Slave_Seller and all other columns go 
into attribute map as long as all of them don’t exceed 64k, but you could 
create dedicate columns for all attributes that you know will always be there.

--
Jacques-Henri Berthemet

From: @Nandan@ [mailto:nandanpriyadarshi...@gmail.com]
Sent: vendredi 9 juin 2017 03:14
To: user@cassandra.apache.org
Subject: Reg:- Data Modelling For Hierarchy Data

Hi,

I am working on Music database where we have multiple order of users of our 
portal. Different category of users is having some common attributes but some 
different attributes based on their registration.
This becomes a hierarchy pattern. I am attaching one sample hierarchy pattern 
of User Module which is somehow part of my current data modeling.

There are few conditions:-
1) email id should be unique. i.e If some user registered with one email id 
then that particular user can't able to register as another user.
2) Some type of users having 20-30 columns as in their registration. such as 
company,address,email,first_name,join_date etc..

Query pattern is like:-
1) select user by email

Please suggest me how to do data modeling for these type of hierarchy data.
Should I create a seperate table for the seperate type of users or should I go 
with single user table?
As we have unique email id condition, so should I go with email id as a primary 
key or user_id UUID will be the best choice.



Best regards,
Nandan Priyadarshi