[ https://issues.apache.org/jira/browse/CASSANDRA-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702693#comment-17702693 ]
Andres de la Peña edited comment on CASSANDRA-17940 at 3/20/23 1:39 PM: ------------------------------------------------------------------------ Thanks for the review. I have been doing some profiling, as suggested by [~blerer]. It has shown that [this call to List#addAll|https://github.com/apache/cassandra/blob/641b906be5974776544a45b3f5ec3c81d7b11c53/src/java/org/apache/cassandra/cql3/functions/masking/ColumnMask.java#L123] on {{ColumnMask#mask}} was taking a toll on performance: !read_ddm_enabled_default_addall.png|width=627,height=687! I have made [some changes|https://github.com/apache/cassandra/pull/2193/commits/5495ec4de9c8139ba3ce163762ad66891901a985] to store the partial arguments as an array, use {{System.arraycopy}} when there are multiple arguments, and optimize for the case where there are no partial arguments. A [further improvement|https://github.com/apache/cassandra/pull/2193/commits/85adba9ded965516f67e0b00b23bade5f00e1991], at least for native functions, is exploiting the fact that all the partial arguments in a column mask are fixed and can be either be memoized or ignored if the function doesn't take such arguments. I have added a new {{NativeFunction#Masker}} class to represent this. The {{ColumnMask}} caches the {{{}Masker{}}}, so it only has to call {{Masker#mask(ByteBuffer)}} with the value of the masked column. That allows us to avoid playing with collections on every call to {{{}ColumnMask#mask{}}}. It also skips the unserializing of the partial arguments on very function call. After these two improvements, the flame graphs for running some queries look like: !select_query_with_ddm.jpg|width=816,height=365! I don't see any noticable difference between unpatched trunk and the feature branch with DDM disabled. Enabling DDM without masking any column doesn't seem to make a difference either. If we mask the queried columns with either {{mask_default}} or {{mask_replace}} the query times seem to increase a little bit. That increase seems to be caused not by the mask application, but by the use of {{SelectionWithProcessing}} instead of {{SimpleSelection}}. If we use {{mask_inner}} we see a more noticeable impact of masking. This is due to the masking function unserializing the column value and serializing its masked value, [here|https://github.com/apache/cassandra/blob/641b906be5974776544a45b3f5ec3c81d7b11c53/src/java/org/apache/cassandra/cql3/functions/masking/PartialMaskingFunction.java#L119-L121]. I'd say this is acceptable, although perhaps we could try to optimize that function by delegating its action to the type serializer. In any case, that seems like something to do in a separate ticket. was (Author: adelapena): Thanks for the review. I have been doing some profiling, as suggested by [~blerer]. It has shown that [this call to List#addAll|https://github.com/apache/cassandra/blob/641b906be5974776544a45b3f5ec3c81d7b11c53/src/java/org/apache/cassandra/cql3/functions/masking/ColumnMask.java#L123] on {{ColumnMask#mask}} was taking a toll on performance: !read_ddm_enabled_default_addall.png|width=627,height=687! I have made some changes to store the partial arguments as an array, use {{System.arraycopy}} when there are multiple arguments, and optimize for the case where there are no partial arguments. A further improvement, at least for native functions, is exploiting the fact that all the partial arguments in a column mask are fixed and can be either be memoized or ignored if the function doesn't take such arguments. I have added a new {{NativeFunction#Masker}} class to represent this. The {{ColumnMask}} caches the {{{}Masker{}}}, so it only has to call {{Masker#mask(ByteBuffer)}} with the value of the masked column. That allows us to avoid playing with collections on every call to {{{}ColumnMask#mask{}}}. It also skips the unserializing of the partial arguments on very function call. After these two improvements, the flame graphs for running some queries look like: !select_query_with_ddm.jpg|width=816,height=365! I don't see any noticable difference between unpatched trunk and the feature branch with DDM disabled. Enabling DDM without masking any column doesn't seem to make a difference either. If we mask the queried columns with either {{mask_default}} or {{mask_replace}} the query times seem to increase a little bit. That increase seems to be caused not by the mask application, but by the use of {{SelectionWithProcessing}} instead of {{SimpleSelection}}. If we use {{mask_inner}} we see a more noticeable impact of masking. This is due to the masking function unserializing the column value and serializing its masked value, [here|https://github.com/apache/cassandra/blob/641b906be5974776544a45b3f5ec3c81d7b11c53/src/java/org/apache/cassandra/cql3/functions/masking/PartialMaskingFunction.java#L119-L121]. I'd say this is acceptable, although perhaps we could try to optimize that function by delegating its action to the type serializer. In any case, that seems like something to do in a separate ticket. > CEP-20: Dynamic Data Masking > ---------------------------- > > Key: CASSANDRA-17940 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17940 > Project: Cassandra > Issue Type: Epic > Components: Feature/Dynamic Data Masking > Reporter: Andres de la Peña > Assignee: Andres de la Peña > Priority: Normal > Fix For: 5.x > > Attachments: read_ddm_enabled_default_addall.png, > select_query_with_ddm.jpg, select_query_with_ddm_detail.jpg > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Implementation of dynamic data masking as described in > [CEP-20|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking]. > Dynamic data masking (DDM) allows to obscure sensitive information while > still allowing access to the masked columns, and without changing the stored > data. The CEP aims to provide DDM for Apache Cassandra using both native and > user-defined CQL functions. These functions can be either be used on their > own or be attached to specific columns in the table definition. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org