Re: [DISCUSS] CEP-20: Dynamic Data Masking

Andrés de la Peña Mon, 22 Aug 2022 05:11:13 -0700

>
> Isn't there an assumption here that encryption can not be used?  Would we
> not be better served to build in an encryption strategy that keeps the data
> encrypted until the user shows permissions to decrypt, like the unmask
> property?  An encryption strategy that can work within the Cassandra
> internals?
> I think that issue is that there are some data fields that should not be
> discoverable by unauthorized users/systems, and I think this solution masks
> that issue.  I fear that this capability will be seized upon by pointy
> haired managers as a cheaper alternative to encryption, regardless of the
> warnings otherwise, and that as a whole will harm the Cassandra ecosystem.



Data encryption, access permissions and data masking are different
solutions to different problems. We don't have to choose between them, and
indeed we should aim to support the three of them at some point. None of
these features impedes the implementation of the others. Actually, is quite
common for popular databases to provide all of them.

Data encryption should protect the data files from anyone that has direct
access to the data files, such sstables, commitlog, etc. It offers
protection outside the interfaces of the database. Of course there is also
encryption of communications.

Permissions should completely prevent the access of unauthorized users to
the data within the database interface. Currently we have permissions on
CQL at the keyspace and table level, but we are missing column-level
permissions.

Data masking obfuscates all or part of the data without totally forbidding
access to it. The key here is that the masked data can still contain parts
of the original information, or be representative enough. For example,
masking can obfuscate all the digits of a credit card number except the
last four, so the clear digits can be used for some degree of
identification. As another example, a masking function returning the hash
would allow to join the masked data of different sources without exposing
it.

An example of how data masking and permissions can be used together could
be a company storing the social security numbers (SSN) of its customers.
The accounting team might need full access to the stored SSNs. Employees
attending phone calls might need to ask for the last two digits of SSN for
identification purposes, so they would need masked access. The rest of the
organization would need no access at all.

This CEP focuses exclusively on data masking, but there is no reason not to
start parallel work on other related-but-different features like
column-level permissions on on-disk data encryption.




On Mon, 22 Aug 2022 at 07:05, Claude Warren, Jr via dev <
dev@cassandra.apache.org> wrote:

> I am more interested in the motivation where it is stated:
>
> Many users have the need of masking sensitive data, such as contact info,
>> age, gender, credit card numbers, etc. Dynamic data masking (DDM) allows to
>> obscure sensitive information while still allowing access to the masked
>> columns, and without changing the stored data.
>
>
> There is an unspoken assumption that the stored data format can not be
> changed.  It feels like this solution is starting from a false premise.
> Throughout the document there are guard statements about how this does not
> replace encryption.  Isn't there an assumption here that encryption can not
> be used?  Would we not be better served to build in an encryption strategy
> that keeps the data encrypted until the user shows permissions to decrypt,
> like the unmask property?  An encryption strategy that can work within the
> Cassandra internals?
>
> I think that issue is that there are some data fields that should not be
> discoverable by unauthorized users/systems, and I think this solution masks
> that issue.  I fear that this capability will be seized upon by pointy
> haired managers as a cheaper alternative to encryption, regardless of the
> warnings otherwise, and that as a whole will harm the Cassandra ecosystem.
>
> Yes, encryption is more difficult to implement and will take longer, but
> this feels like a sticking plaster that distracts from that underlying
> issue.
>
> my 0.02
>
> On Mon, Aug 22, 2022 at 12:30 AM Andrés de la Peña <adelap...@apache.org>
> wrote:
>
>> > If the column names are the same for masked and unmasked data, it would
>>> impact existing applications. I am curious what the transition plan look
>>> like for applications that expect unmasked data?
>>
>> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>>> feature, let’s say the app user is not given the UNMASK permission. Now the
>>> app is receiving masked values for these columns. This is fine for most
>>> read only applications. However, a lot of times these columns may be used
>>> as primary keys or part of primary keys in other tables. This would break
>>> existing applications.
>>> How would this work in mixed mode when  ew nodes in the cluster are
>>> masking data and others aren’t? How would it impact the driver?
>>> How would the application learn that the column values are masked? This
>>> is important in case a user has UNMASK permission and then later taken
>>> away. Again this would break a lot of applications.
>>
>>
>> Changing the masking of a column is a schema change, and as such it can
>> be risky for existing applications. However, differently to deleting a
>> column or revoking a SELECT permission, suddenly activating masking might
>> pass undetected for existing applications.
>>
>> Applications developed after the introduction of this feature can check
>> the table schema to know if a column is masked or not. We can even add a
>> specific system view to ease this, if we think it's worth it. However,
>> administrators should not activate masking when there could be applications
>> that are not aware of the feature. We should be clear about this in the
>> documentation.
>>
>> This is the way data masking seems to work in the databases I've checked.
>> I also though that we could just change the name of the column when it's
>> masked to something as "masked(column_name)", as it is discussed in the CEP
>> document. This would make it impossible to miss that a column is masked.
>> However, applications should be prepared to use different column names when
>> reading result sets, depending on whether the data is masked for them or
>> not. None of the databases mentioned on the "other databases" section of
>> the CEP does this kind of column renaming, so it might be a kind of exotic
>> behaviour. wdyt?
>>
>> On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña <adelap...@apache.org>
>> wrote:
>>
>>> > This type of feature is very useful, but it may be easier to analyze
>>>> this proposal if it’s compared with other DDM implementations from other
>>>> databases? Would it be reasonable to add a table to the proposal comparing
>>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>>
>>>
>>> Good idea. I have added a section at the end of the document briefly
>>> describing how some other databases deal with data masking, and with links
>>> to their documentation for the topic. I am not an expert in none of those
>>> databases, so please take my comments there with a grain of salt.
>>>
>>> On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> This type of feature is very useful, but it may be easier to analyze
>>>> this proposal if it’s compared with other DDM implementations from other
>>>> databases? Would it be reasonable to add a table to the proposal comparing
>>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>>>
>>>>
>>>> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña <adelap...@apache.org>
>>>> wrote:
>>>>
>>>> 
>>>> Hi everyone,
>>>>
>>>> I'd like to start a discussion about this proposal for dynamic data
>>>> masking:
>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>>>>
>>>> Dynamic data masking allows to obscure sensitive information without
>>>> changing the stored data. It would be based on a set of native CQL
>>>> functions providing different types of masking, such as replacing the
>>>> column value by "XXXX". These functions could be used as regular functions
>>>> or attached to table columns with CREATE/ALTER table. There would be a new
>>>> UNMASK permission, so only the users with this permissions would be able to
>>>> see the unmasked column values. It would be possible to customize masking
>>>> by using UDFs as masking functions.
>>>>
>>>> Thanks,
>>>>
>>>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

Reply via email to