Re: [DISCUSS] CEP-20: Dynamic Data Masking

Andrés de la Peña Mon, 22 Aug 2022 03:21:03 -0700

>
> Maybe a small improvement is the redacted value could be of the form
> `XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: XXX54,
> XXX998, XXX456,... Some randomness would prevent some apps flattening all
> rows to a single XXX'ed one, giving a more realistic redacted data
> distribution/structure.



I'm not sure I understand why that would be useful. Why would random
suffixes give us a more realistic redacted data distribution? If we want to
avoid returning always the same value, we could use a function that just
return the random value, without the XXXX part, so we can use any data
type. Microsoft's SQL Server and Azure SQL have this function among their
masking functions.

Nevertheless, it would be quite easy to keep adding new masking functions
when we need them.

On Mon, 22 Aug 2022 at 06:52, Berenguer Blasi <[email protected]>
wrote:

> Maybe a small improvement is the redacted value could be of the form
> `XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: XXX54,
> XXX998, XXX456,... Some randomness would prevent some apps flattening all
> rows to a single XXX'ed one, giving a more realistic redacted data
> distribution/structure.
>
> I am not sure either about it's value, as that would still break any key
> or other cross-referencing.
>
> My 2cts.
> On 22/8/22 1:30, Andrés de la Peña wrote:
>
> > If the column names are the same for masked and unmasked data, it would
>> impact existing applications. I am curious what the transition plan look
>> like for applications that expect unmasked data?
>
> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>> feature, let’s say the app user is not given the UNMASK permission. Now the
>> app is receiving masked values for these columns. This is fine for most
>> read only applications. However, a lot of times these columns may be used
>> as primary keys or part of primary keys in other tables. This would break
>> existing applications.
>> How would this work in mixed mode when  ew nodes in the cluster are
>> masking data and others aren’t? How would it impact the driver?
>> How would the application learn that the column values are masked? This
>> is important in case a user has UNMASK permission and then later taken
>> away. Again this would break a lot of applications.
>
>
> Changing the masking of a column is a schema change, and as such it can be
> risky for existing applications. However, differently to deleting a column
> or revoking a SELECT permission, suddenly activating masking might pass
> undetected for existing applications.
>
> Applications developed after the introduction of this feature can check
> the table schema to know if a column is masked or not. We can even add a
> specific system view to ease this, if we think it's worth it. However,
> administrators should not activate masking when there could be applications
> that are not aware of the feature. We should be clear about this in the
> documentation.
>
> This is the way data masking seems to work in the databases I've checked.
> I also though that we could just change the name of the column when it's
> masked to something as "masked(column_name)", as it is discussed in the CEP
> document. This would make it impossible to miss that a column is masked.
> However, applications should be prepared to use different column names when
> reading result sets, depending on whether the data is masked for them or
> not. None of the databases mentioned on the "other databases" section of
> the CEP does this kind of column renaming, so it might be a kind of exotic
> behaviour. wdyt?
>
> On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña <[email protected]>
> wrote:
>
>> > This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>
>>
>> Good idea. I have added a section at the end of the document briefly
>> describing how some other databases deal with data masking, and with links
>> to their documentation for the topic. I am not an expert in none of those
>> databases, so please take my comments there with a grain of salt.
>>
>> On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa <[email protected]> wrote:
>>
>>> This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>>
>>>
>>> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña <[email protected]>
>>> wrote:
>>>
>>> 
>>> Hi everyone,
>>>
>>> I'd like to start a discussion about this proposal for dynamic data
>>> masking:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>>>
>>> Dynamic data masking allows to obscure sensitive information without
>>> changing the stored data. It would be based on a set of native CQL
>>> functions providing different types of masking, such as replacing the
>>> column value by "XXXX". These functions could be used as regular functions
>>> or attached to table columns with CREATE/ALTER table. There would be a new
>>> UNMASK permission, so only the users with this permissions would be able to
>>> see the unmasked column values. It would be possible to customize masking
>>> by using UDFs as masking functions.
>>>
>>> Thanks,
>>>
>>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

Reply via email to