Re: [DISCUSS] CEP-20: Dynamic Data Masking

Andrés de la Peña Wed, 24 Aug 2022 06:32:38 -0700

Where does MySQL suggest that? As far I can tell MySQL only offers a set of
functions for masking. I can't see a way to force users or tables to use
those functions, and is up to the users to use those functions or not. I'm
reading this documentation
<https://dev.mysql.com/doc/refman/8.0/en/data-masking.html>.


As for broadening the scope the proposal to prevent malicious users from
inferring the masked data, I guess that the additional rule would simply be
that a user with READ but not UNMASK permissions cannot use masked columns
on WHERE or IF clauses. That would include both SELECT and UPDATE
statements. That would differentiate us from many popular databases out
there, where data masking usually is a simpler thing.

On Wed, 24 Aug 2022 at 14:08, Benedict <bened...@apache.org> wrote:

> I can’t tell for sure, but the documentation on Postgres’ feature suggests
> to me that it does apply the masking to all possible uses of the data,
> including joining and querying.
>
> Snowflake’s documentation explicitly says that it does.
>
> MySQL’s documentation suggests that it does this.
>
> Oracle, AWS and MS SQL do not.
>
> My inclination would be to - at least by default - forbid querying on
> columns that are masked, unless the mask permits it.
>
>
> On 24 Aug 2022, at 11:06, Andrés de la Peña <adelap...@apache.org> wrote:
>
> 
> Here are the names of the feature on same databases out there, errors and
> omission excepted:
>
>    - Microsoft SQL Server / Azure SQL: Dynamic data masking
>    - MySQL: Enterprise data masking and de-identification
>    - PostgreSQL: Dynamic masking
>    - MongoDB: Data masking
>    - IBM Db2: Masks
>    - Oracle: Redaction
>    - MariaDB/MaxScale: Data masking
>    - Snowflake: Dynamic data masking
>
>
> On Wed, 24 Aug 2022 at 10:40, Benedict <bened...@apache.org> wrote:
>
>> Right, but we get to decide how we offer such features and what we call
>> them. I can’t imagine a good reason to call this a masking feature,
>> especially one that applies differentially to certain users, when it is
>> trivial to unmask.
>>
>> I’m ok offering a feature called “default formatter” or something that
>> applies some UDF to a field before returning to the client, and if users
>> wish to “mask” their data in this way that’s fine. But calling it a data
>> mask when it is trivial to circumvent is IMO dangerous, and I’d at least
>> want to see evidence that all other equivalent features in the industry are
>> similarly poorly named and offer similarly poor protection.
>>
>> On 24 Aug 2022, at 09:50, Benjamin Lerer <ble...@apache.org> wrote:
>>
>> 
>>
>>> The PCI DSS Standard v4_0
>>> <https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Standard/PCI-DSS-v4_0.pdf>
>>>  requires
>>> that credit card numbers stored on the system must be "rendered
>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>> numbers.
>>
>>
>> My point was simply about the fact that Dynamic Data Masking like any
>> other feature made sense for some scenario but not for others. I apologise
>> if my example was a bad one.
>>
>> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
>> dev@cassandra.apache.org> a écrit :
>>
>>> This change appears to be looking at two aspects:
>>>
>>>    1. Add metadata to columns
>>>    2. Add functionality based on the metadata.
>>>
>>> If the system had a generic user defined metadata and the ability to
>>> define filter functions at the point where data are being returned to the
>>> client it would be possible for users implement this filter, or any other
>>> filter on the data.
>>>
>>> The concept of user defined metadata and filters could be applied to
>>> other parts of the system as well.  For example, if the metadata were
>>> accessible from UDFs the metadata could be used in low level filters to
>>> remove rows from queries before they were returned.
>>>
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr <
>>> claude.war...@aiven.io> wrote:
>>>
>>>> The PCI DSS Standard v4_0
>>>> <https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Standard/PCI-DSS-v4_0.pdf>
>>>>  requires
>>>> that credit card numbers stored on the system must be "rendered
>>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>>> numbers.  In fact, for any critically sensitive data this is not an
>>>> appropriate solution.  However, there seems to be agreement that it is
>>>> appropriate for obfuscating some data in some queries by some users.
>>>>
>>>>
>>>>
>>>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer <b.le...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is it typical for a masking feature to make no effort to prevent
>>>>>> unmasking? I’m just struggling to see the value of this without such
>>>>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>>>>> consider
>>>>>> renaming the feature IMO
>>>>>
>>>>>
>>>>> The security that Dynamic Data Masking is bringing is related to how
>>>>> you make use of the feature. It is somehow the same with passwords. If you
>>>>> use a weak password it does not bring much security.
>>>>> Masking a field like people's gender is useless because you will be
>>>>> able to determine its value in one query. On the other hand masking credit
>>>>> card numbers makes a lot of sense as it will complicate the life of the
>>>>> person trying to have access to it and the queries needed to reach the
>>>>> information will leave some clear traces in the audit log.
>>>>>
>>>>> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good
>>>>> way to protect sensitive data like credit card numbers or passwords.
>>>>>
>>>>>
>>>>> Le mer. 24 août 2022 à 09:40, Benedict <bened...@apache.org> a écrit :
>>>>>
>>>>>> Is it typical for a masking feature to make no effort to prevent
>>>>>> unmasking? I’m just struggling to see the value of this without such
>>>>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>>>>> consider
>>>>>> renaming the feature IMO
>>>>>>
>>>>>> On 23 Aug 2022, at 21:27, Andrés de la Peña <adelap...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> 
>>>>>> As mentioned in the CEP document, dynamic data masking doesn't try to
>>>>>> prevent malicious users with SELECT permissions to indirectly guess the
>>>>>> real value of the masked value. This can easily be done by just trying
>>>>>> values on the WHERE clause of SELECT queries. DDM would not be a
>>>>>> replacement for proper column-level permissions.
>>>>>>
>>>>>> The data served by the database is usually consumed by applications
>>>>>> that present this data to end users. These end users are not necessarily
>>>>>> the users directly connecting to the database. With DDM, it would be easy
>>>>>> for applications to mask sensitive data that is going to be consumed by 
>>>>>> the
>>>>>> end users. However, the users directly connecting to the database should 
>>>>>> be
>>>>>> trusted, provided that they have the right SELECT permissions.
>>>>>>
>>>>>> In other words, DDM doesn't directly protect the data, but it eases
>>>>>> the production of protected data.
>>>>>>
>>>>>> Said that, we could later go one step ahead and add a way to prevent
>>>>>> untrusted users from inferring the masked data. That could be done 
>>>>>> adding a
>>>>>> new permission required to use certain columns on WHERE clauses, 
>>>>>> different
>>>>>> to the current SELECT permission. That would play especially well with
>>>>>> column-level permissions, which is something that we still have pending.
>>>>>>
>>>>>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz <aaronplo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Applying this should prevent querying on a field, else you could
>>>>>>>> leak its contents, surely?
>>>>>>>>
>>>>>>>
>>>>>>> In theory, yes.  Although I could see folks doing something like
>>>>>>> this:
>>>>>>>
>>>>>>> SELECT COUNT(*) FROM patients
>>>>>>> WHERE year_of_birth = 2002
>>>>>>> AND date_of_birth >= '2002-04-01'
>>>>>>> AND date_of_birth < '2002-11-01';
>>>>>>>
>>>>>>> In this case, the rows containing the masked key column(s) could be
>>>>>>> filtered on without revealing the actual data.  But again, that's 
>>>>>>> probably
>>>>>>> better for a "phase 2" of the implementation.
>>>>>>>
>>>>>>> Agreed on not being a queryable field. That would also preclude
>>>>>>>> secondary indexing, right?
>>>>>>>
>>>>>>>
>>>>>>> Yes, that's my thought as well.
>>>>>>>
>>>>>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker <
>>>>>>> de...@chen-becker.org> wrote:
>>>>>>>
>>>>>>>> Agreed on not being a queryable field. That would also preclude
>>>>>>>> secondary indexing, right?
>>>>>>>>
>>>>>>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict <bened...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Applying this should prevent querying on a field, else you could
>>>>>>>>> leak its contents, surely? This pretty much prohibits using it in a
>>>>>>>>> clustering key, and a partition key with the ordered partitioner - but
>>>>>>>>> probably also a hashed partitioner since we do not use a 
>>>>>>>>> cryptographic hash
>>>>>>>>> and the hash function is well defined.
>>>>>>>>>
>>>>>>>>> We probably also need to ensure that any ALLOW FILTERING queries
>>>>>>>>> on such a field are disabled.
>>>>>>>>>
>>>>>>>>> Plausibly the data could be cryptographically jumbled before using
>>>>>>>>> it in a primary key component (or permitting filtering), but it is 
>>>>>>>>> probably
>>>>>>>>> easier and safer to exclude for now…
>>>>>>>>>
>>>>>>>>> On 23 Aug 2022, at 18:13, Aaron Ploetz <aaronplo...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>> Some thoughts on this one:
>>>>>>>>>
>>>>>>>>> In a prior job, we'd give app teams access to a single keyspace,
>>>>>>>>> and two roles: a read-write role and a read-only role.  In some 
>>>>>>>>> cases, a
>>>>>>>>> "privileged" application role was also requested.  Depending on the
>>>>>>>>> requirements, I could see the UNMASK permission being applied to the 
>>>>>>>>> RW or
>>>>>>>>> privileged roles.  But if there's a problem on the table and the 
>>>>>>>>> operators
>>>>>>>>> go in to investigate, they will likely use a SUPERUSER account, and 
>>>>>>>>> they'll
>>>>>>>>> see that data.
>>>>>>>>>
>>>>>>>>> How hard would it be for SUPERUSERs to *not* automatically get the
>>>>>>>>> UNMASK permission?
>>>>>>>>>
>>>>>>>>> I'll also echo the concerns around masking primary key
>>>>>>>>> components.  It's highly likely that certain personal data properties 
>>>>>>>>> would
>>>>>>>>> be used as a partition or clustering key (ex: range query for people 
>>>>>>>>> born
>>>>>>>>> within a certain timeframe).  In addition to the "breaks existing" 
>>>>>>>>> concern,
>>>>>>>>> I'm curious about the challenges around getting that to work with the
>>>>>>>>> current primary key implementation.
>>>>>>>>>
>>>>>>>>> Does this first implementation only apply to payload (non-key)
>>>>>>>>> columns?  The examples in the CEP currently do not show primary key
>>>>>>>>> components being masked.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Aaron
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo <
>>>>>>>>> henrik.i...@datastax.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña <
>>>>>>>>>> adelap...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> One thought: The way the CEP is currently written, it is only
>>>>>>>>>>>> possible to mask a column one way. You can only define one masking 
>>>>>>>>>>>> function
>>>>>>>>>>>> for a column, and since you use the original column name, you 
>>>>>>>>>>>> could only
>>>>>>>>>>>> return one version of it in the result set, even if you had a way 
>>>>>>>>>>>> to define
>>>>>>>>>>>> several functions.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Right, it's one single type of mapping per the column, declared
>>>>>>>>>>> on CREATE/ALTER TABLE statements. Also, users can manually specify 
>>>>>>>>>>> their
>>>>>>>>>>> own masking function in SELECT statements if they have permissions 
>>>>>>>>>>> for
>>>>>>>>>>> seeing the clear data.
>>>>>>>>>>>
>>>>>>>>>>> For those cases where the data is automatically masked for an
>>>>>>>>>>> unprivileged user, I don't see the use of including different types 
>>>>>>>>>>> of
>>>>>>>>>>> masking for the same column into the same result set. Instead, we 
>>>>>>>>>>> might be
>>>>>>>>>>> interested on having different types of masking associated to 
>>>>>>>>>>> different
>>>>>>>>>>> roles. We could do so with dedicated CREATE/DROP/LIST MASK 
>>>>>>>>>>> statements,
>>>>>>>>>>> instead of using the CREATE/ALTER/DESCRIBE TABLE statements. That 
>>>>>>>>>>> CREATE
>>>>>>>>>>> MASK statement would associate a masking function to a column and 
>>>>>>>>>>> role.
>>>>>>>>>>> However, I'm not sure we need that type of granularity instead of 
>>>>>>>>>>> the
>>>>>>>>>>> simplicity of attaching the masking to the column declaration. wdyt?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> My gut feeling likewise is that this adds complexity but little
>>>>>>>>>> value.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Henrik Ingo
>>>>>>>>>>
>>>>>>>>>> +358 40 569 7354 <358405697354>
>>>>>>>>>>
>>>>>>>>>> [image: Visit us online.] <https://www.datastax.com/>  [image:
>>>>>>>>>> Visit us on Twitter.] <https://twitter.com/DataStaxEng>  [image:
>>>>>>>>>> Visit us on YouTube.]
>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
>>>>>>>>>>   [image: Visit my LinkedIn profile.]
>>>>>>>>>> <https://www.linkedin.com/in/heingo/>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> +---------------------------------------------------------------+
>>>>>>>> | Derek Chen-Becker                                             |
>>>>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
>>>>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>>>>>> +---------------------------------------------------------------+
>>>>>>>>
>>>>>>>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

Reply via email to