Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-19 Thread Andrés de la Peña
>
> > This type of feature is very useful, but it may be easier to analyze
> this proposal if it’s compared with other DDM implementations from other
> databases? Would it be reasonable to add a table to the proposal comparing
> syntax and output from eg Azure SQL vs Cassandra vs whatever ?


Good idea. I have added a section at the end of the document briefly
describing how some other databases deal with data masking, and with links
to their documentation for the topic. I am not an expert in none of those
databases, so please take my comments there with a grain of salt.

On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:

> This type of feature is very useful, but it may be easier to analyze this
> proposal if it’s compared with other DDM implementations from other
> databases? Would it be reasonable to add a table to the proposal comparing
> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>
>
> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña 
> wrote:
>
> 
> Hi everyone,
>
> I'd like to start a discussion about this proposal for dynamic data
> masking:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>
> Dynamic data masking allows to obscure sensitive information without
> changing the stored data. It would be based on a set of native CQL
> functions providing different types of masking, such as replacing the
> column value by "". These functions could be used as regular functions
> or attached to table columns with CREATE/ALTER table. There would be a new
> UNMASK permission, so only the users with this permissions would be able to
> see the unmasked column values. It would be possible to customize masking
> by using UDFs as masking functions.
>
> Thanks,
>
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-19 Thread Jeff Jirsa
This type of feature is very useful, but it may be easier to analyze this 
proposal if it’s compared with other DDM implementations from other databases? 
Would it be reasonable to add a table to the proposal comparing syntax and 
output from eg Azure SQL vs Cassandra vs whatever ? 


> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña  wrote:
> 
> 
> Hi everyone,
> 
> I'd like to start a discussion about this proposal for dynamic data masking: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
> 
> Dynamic data masking allows to obscure sensitive information without changing 
> the stored data. It would be based on a set of native CQL functions providing 
> different types of masking, such as replacing the column value by "". 
> These functions could be used as regular functions or attached to table 
> columns with CREATE/ALTER table. There would be a new UNMASK permission, so 
> only the users with this permissions would be able to see the unmasked column 
> values. It would be possible to customize masking by using UDFs as masking 
> functions.
> 
> Thanks,


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-19 Thread Dinesh Joshi
sounds interesting. I would like to understand a couple things here. If the 
column names are the same for masked and unmasked data, it would impact 
existing applications. I am curious what the transition plan look like for 
applications that expect unmasked data?

For example, let’s say you store SSNs and Birth dates. Upon enabling this 
feature, let’s say the app user is not given the UNMASK permission. Now the app 
is receiving masked values for these columns. This is fine for most read only 
applications. However, a lot of times these columns may be used as primary keys 
or part of primary keys in other tables. This would break existing applications.

How would this work in mixed mode when  ew nodes in the cluster are masking 
data and others aren’t? How would it impact the driver?

How would the application learn that the column values are masked? This is 
important in case a user has UNMASK permission and then later taken away. Again 
this would break a lot of applications.

Dinesh

> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña  wrote:
> 
> 
> Hi everyone,
> 
> I'd like to start a discussion about this proposal for dynamic data masking: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
> 
> Dynamic data masking allows to obscure sensitive information without changing 
> the stored data. It would be based on a set of native CQL functions providing 
> different types of masking, such as replacing the column value by "". 
> These functions could be used as regular functions or attached to table 
> columns with CREATE/ALTER table. There would be a new UNMASK permission, so 
> only the users with this permissions would be able to see the unmasked column 
> values. It would be possible to customize masking by using UDFs as masking 
> functions.
> 
> Thanks,


Re: Is this an MV bug?

2022-08-19 Thread Benedict
You mean entirely distinct CQL statements issued by the same client 
“concurrently”?

If they’re submitted to the same coordinator then M2 will have a higher 
timestamp than M1, so if M2 applies first then M1 will be a no-op and should 
not generate any view update.

If submitted to different coordinators with server-issued timestamps then 
unless timestamps clash, one of them will win, but it may not be M2.

> On 19 Aug 2022, at 11:14, Claude Warren, Jr via dev 
>  wrote:
> 
> Perhaps my diagram was not clear.  I am starting with mutations on the base 
> table.  I assume they are not bundled together so from separate CQL 
> statements.
> 
> On Fri, Aug 19, 2022 at 11:11 AM Claude Warren, Jr  
> wrote:
>> If each mutation comes from a separate CQL they would be separate, no?
>> 
>> 
>> On Fri, Aug 19, 2022 at 10:17 AM Benedict  wrote:
>>> If M1 and M2 both operate over the same partition key they won’t be 
>>> separate mutations, they should be combined into a single mutation before 
>>> submission to SP.mutate
>>> 
>>> > On 19 Aug 2022, at 10:05, Claude Warren, Jr via dev 
>>> >  wrote:
>>> > 
>>> > 
>>> > 
>>> > # Table definitions
>>> > 
>>> > Table [ Primary key ] other data
>>> > base  [ A B C ] D E 
>>> > MV[ D C ] A B E
>>> > 
>>> > 
>>> > # Initial  data
>>> > base   -> MV 
>>> > [ a b c ] d e  -> [d c] a b e
>>> > [ a' b c ] d e -> [d c] a' b e
>>> > 
>>> > 
>>> > ## Mutations -> expected outcome
>>> > 
>>> > M1: base [ a b c ] d e'  -> MV [ d c ] a b e'
>>> > M2: base [ a b c ] d' e -> MV [ d' c ] a b e
>>> > 
>>> > ## processing bug
>>> > Assume lock can not be obtained during processing of M1.
>>> > 
>>> > The mutation M1 sleeps to wait for lock. (Trunk Keyspace.java : 601 )
>>> > 
>>> > Assume M2 obtains the lock and executes.
>>> > 
>>> > MV is now 
>>> > [ d' c ] a b e
>>> > 
>>> > M1 then obtains the lock and executes
>>> > 
>>> > MV is now 
>>> > [ d c ] a b e'
>>> > [ d' c] a b e
>>> > 
>>> > base is 
>>> > [ a b c ] d e'
>>> > 
>>> > MV entry "[ d' c ] a b e" is orphaned
>>> > 
>>> >


[DISCUSS] CEP-20: Dynamic Data Masking

2022-08-19 Thread Andrés de la Peña
Hi everyone,

I'd like to start a discussion about this proposal for dynamic data
masking:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking

Dynamic data masking allows to obscure sensitive information without
changing the stored data. It would be based on a set of native CQL
functions providing different types of masking, such as replacing the
column value by "". These functions could be used as regular functions
or attached to table columns with CREATE/ALTER table. There would be a new
UNMASK permission, so only the users with this permissions would be able to
see the unmasked column values. It would be possible to customize masking
by using UDFs as masking functions.

Thanks,


Re: Is this an MV bug?

2022-08-19 Thread Claude Warren, Jr via dev
Perhaps my diagram was not clear.  I am starting with mutations on the base
table.  I assume they are not bundled together so from separate CQL
statements.

On Fri, Aug 19, 2022 at 11:11 AM Claude Warren, Jr 
wrote:

> If each mutation comes from a separate CQL they would be separate, no?
>
>
> On Fri, Aug 19, 2022 at 10:17 AM Benedict  wrote:
>
>> If M1 and M2 both operate over the same partition key they won’t be
>> separate mutations, they should be combined into a single mutation before
>> submission to SP.mutate
>>
>> > On 19 Aug 2022, at 10:05, Claude Warren, Jr via dev <
>> dev@cassandra.apache.org> wrote:
>> >
>> > 
>> >
>> > # Table definitions
>> >
>> > Table [ Primary key ] other data
>> > base  [ A B C ] D E
>> > MV[ D C ] A B E
>> >
>> >
>> > # Initial  data
>> > base   -> MV
>> > [ a b c ] d e  -> [d c] a b e
>> > [ a' b c ] d e -> [d c] a' b e
>> >
>> >
>> > ## Mutations -> expected outcome
>> >
>> > M1: base [ a b c ] d e'  -> MV [ d c ] a b e'
>> > M2: base [ a b c ] d' e -> MV [ d' c ] a b e
>> >
>> > ## processing bug
>> > Assume lock can not be obtained during processing of M1.
>> >
>> > The mutation M1 sleeps to wait for lock. (Trunk Keyspace.java : 601 )
>> >
>> > Assume M2 obtains the lock and executes.
>> >
>> > MV is now
>> > [ d' c ] a b e
>> >
>> > M1 then obtains the lock and executes
>> >
>> > MV is now
>> > [ d c ] a b e'
>> > [ d' c] a b e
>> >
>> > base is
>> > [ a b c ] d e'
>> >
>> > MV entry "[ d' c ] a b e" is orphaned
>> >
>> >
>>
>>


Re: Is this an MV bug?

2022-08-19 Thread Claude Warren, Jr via dev
If each mutation comes from a separate CQL they would be separate, no?


On Fri, Aug 19, 2022 at 10:17 AM Benedict  wrote:

> If M1 and M2 both operate over the same partition key they won’t be
> separate mutations, they should be combined into a single mutation before
> submission to SP.mutate
>
> > On 19 Aug 2022, at 10:05, Claude Warren, Jr via dev <
> dev@cassandra.apache.org> wrote:
> >
> > 
> >
> > # Table definitions
> >
> > Table [ Primary key ] other data
> > base  [ A B C ] D E
> > MV[ D C ] A B E
> >
> >
> > # Initial  data
> > base   -> MV
> > [ a b c ] d e  -> [d c] a b e
> > [ a' b c ] d e -> [d c] a' b e
> >
> >
> > ## Mutations -> expected outcome
> >
> > M1: base [ a b c ] d e'  -> MV [ d c ] a b e'
> > M2: base [ a b c ] d' e -> MV [ d' c ] a b e
> >
> > ## processing bug
> > Assume lock can not be obtained during processing of M1.
> >
> > The mutation M1 sleeps to wait for lock. (Trunk Keyspace.java : 601 )
> >
> > Assume M2 obtains the lock and executes.
> >
> > MV is now
> > [ d' c ] a b e
> >
> > M1 then obtains the lock and executes
> >
> > MV is now
> > [ d c ] a b e'
> > [ d' c] a b e
> >
> > base is
> > [ a b c ] d e'
> >
> > MV entry "[ d' c ] a b e" is orphaned
> >
> >
>
>


Re: Is this an MV bug?

2022-08-19 Thread Benedict
If M1 and M2 both operate over the same partition key they won’t be separate 
mutations, they should be combined into a single mutation before submission to 
SP.mutate

> On 19 Aug 2022, at 10:05, Claude Warren, Jr via dev 
>  wrote:
> 
> 
> 
> # Table definitions
> 
> Table [ Primary key ] other data
> base  [ A B C ] D E 
> MV[ D C ] A B E
> 
> 
> # Initial  data
> base   -> MV 
> [ a b c ] d e  -> [d c] a b e
> [ a' b c ] d e -> [d c] a' b e
> 
> 
> ## Mutations -> expected outcome
> 
> M1: base [ a b c ] d e'  -> MV [ d c ] a b e'
> M2: base [ a b c ] d' e -> MV [ d' c ] a b e
> 
> ## processing bug
> Assume lock can not be obtained during processing of M1.
> 
> The mutation M1 sleeps to wait for lock. (Trunk Keyspace.java : 601 )
> 
> Assume M2 obtains the lock and executes.
> 
> MV is now 
> [ d' c ] a b e
> 
> M1 then obtains the lock and executes
> 
> MV is now 
> [ d c ] a b e'
> [ d' c] a b e
> 
> base is 
> [ a b c ] d e'
> 
> MV entry "[ d' c ] a b e" is orphaned
> 
> 



Is this an MV bug?

2022-08-19 Thread Claude Warren, Jr via dev
# Table definitions

Table [ Primary key ] other data
base  [ A B C ] D E
MV[ D C ] A B E


# Initial  data
base   -> MV
[ a b c ] d e  -> [d c] a b e
[ a' b c ] d e -> [d c] a' b e


## Mutations -> expected outcome

M1: base [ a b c ] d e'  -> MV [ d c ] a b e'
M2: base [ a b c ] d' e -> MV [ d' c ] a b e

## processing bug
Assume lock can not be obtained during processing of M1.

The mutation M1 sleeps to wait for lock. (Trunk Keyspace.java : 601 )

Assume M2 obtains the lock and executes.

MV is now
[ d' c ] a b e

M1 then obtains the lock and executes

MV is now
[ d c ] a b e'
[ d' c] a b e

base is
[ a b c ] d e'

MV entry "[ d' c ] a b e" is orphaned


Re: [Proposal] add pull request template

2022-08-19 Thread Claude Warren, Jr via dev
Since there seems to be agreement, I opened a ticket (CASSANDRA-17837) and
a pull request (https://github.com/apache/cassandra/pull/1799) in so that
the final text can be hashed out and accepted.

I also used the proposed pull request in the text of the pull so that it
can be seen in all its glory 

On Thu, Aug 18, 2022 at 9:10 PM Josh McKenzie  wrote:

> I have never seen this
> kind of git merging strategy elsewhere, I am not sure if I am not
> experienced enough or we are truly unique the way we do things.
>
> I am very fond of this project and this community. THAT SAID ;) you could
> replace "kind of git merging strategy" with a lot of different things and
> have it equally apply on this project.
>
> Perils of being a mature long-lived project I suspect. I'm all for us
> doing the hard work of introspecting on how we do things and changing them
> to improve or match industry standards where applicable.
>
> On Thu, Aug 18, 2022, at 3:33 PM, Stefan Miklosovic wrote:
>
> Interesting, thanks for explicitly writing that down. I humbly think
> the CI and the convenience of the GitHub workflow is ultimately
> secondary when it comes to the code-base as such. Indeed, nice to
> have, but if it turns out to be uncomfortable in other ways, I guess
> we just have to live with what we have. TBH I have never seen this
> kind of git merging strategy elsewhere, I am not sure if I am not
> experienced enough or we are truly unique the way we do things.
> However, it does make sense.
>
> On Thu, 18 Aug 2022 at 21:28, Benedict 
> wrote:
> >
> > The benefits being extolled involve people setting up GitHub bots to
> integrate with PRs to run CI etc, which will require some non-trivial
> investment by somebody to put together
> >
> > The alternative merge strategy being discussed is not to merge, but to
> instead cherry-pick or rebase. This means we can produce separate PRs for
> each branch, that can be merged independently via the GitHub API. The
> downside of this is that there are no merge commits, while one upside of
> this is that there are no merge commits.
> >
> > On 18 Aug 2022, at 20:20, Stefan Miklosovic <
> stefan.mikloso...@instaclustr.com> wrote:
> >
> > No chicken-egg to me. All it takes is ctrl+c & ctrl+v on your merging
> > commits. How would new merging strategy actually look like? I am all
> > ears. This seems to be quite nice as is if we stick to be more verbose
> > what we did.
> >
> > On Thu, 18 Aug 2022 at 20:27, Benedict  wrote:
> >
> >
> > Was it?
> >
> >
> > I mean, we’ve all (or most) I think worked on projects with those
> things, so we all know what the benefits are?
> >
> >
> > It’s fair to point out that we don’t have it even running for any branch
> yet. However there’s perhaps a chicken-and-egg situation, where I’m unsure
> the investment to develop can be justified by those who are able, if
> there’s a chance it will be discarded? I can’t see us maintaining a
> bifurcated process, where some patches go through automation and others
> don’t, so if we don’t change the merge strategy that work would presumably
> end up wasted.
> >
> >
> > On 18 Aug 2022, at 18:53, Mick Semb Wever  wrote:
> >
> >
> > 
> >
> >
> > That debatable benefit aside, not doing merge commits would also open up
> options for us to use PR's for merges and integrate running CI, and
> blocking on clean CI, pre-merge. Which has some other pretty big benefits.
> :)
> >
> >
> >
> >
> > The past agreement IIRC was to start doing those things on trunk-only so
> we can evaluate them for real.
>
>
>