[jira] [Commented] (KUDU-1812) Redact user data that gets logged
[ https://issues.apache.org/jira/browse/KUDU-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765841#comment-15765841 ] Adar Dembo commented on KUDU-1812: -- Given the nature of this problem, it's important for there to be a clear description of Kudu's redaction policy. For users, so they know what to expect, and for developers, who are meant to adhere to it. I'll try to do that below, based on a discussion I had with JD and Dan today. As of today, Kudu takes a somewhat narrow definition of "sensitive user data": row data values. In the future, this definition could be broadened to include things like table and column names, but for the time being we're only considering row data values and data derived from row data values. Row data values are obviously passed between server and client in client-facing operations (such as writes and scans), but are also passed between servers in server-to-server operations (such as log replication); it's important that we don't leak them in either case. Besides direct row data (such as a batch of INSERT, UPDATE, UPSERT, or DELETE operations), we must also consider data that may imply the existence of row data, such as scan predicates. Server-side, Kudu will adhere to the following policy: # There will be a new gflag to control whether sensitive user data will be redacted. This flag will exist for both masters and tservers, and will default to 'true'. # When a Kudu server logs a message containing sensitive user data, the gflag's value must be consulted. If true, the sensitive data must be replaced with a "" string. The rest of the message can remain the same. # The same applies to errors returned by Kudu servers, should they embed sensitive user data. Client-side, Kudu will adhere to the following policy: # Sensitive user data may be returned in toString() (Java) or ToString() (C++) calls. # All sensitive user data must be explicitly stripped from all LOG(), VLOG(), and slf4j log statements. # All sensitive user data must be explicitly stripped from all thrown exceptions (Java) or Status messages (C++). Taken together, these policies should ensure that the following never leak sensitive user data: * A Kudu client implementation can log all errors returned by a server (assuming the gflag's value was 'true'). * An application can log all errors returned by a Kudu client. * A log collection service can collect all Kudu server logs. > Redact user data that gets logged > - > > Key: KUDU-1812 > URL: https://issues.apache.org/jira/browse/KUDU-1812 > Project: Kudu > Issue Type: Improvement >Reporter: Jean-Daniel Cryans >Priority: Critical > > There are many instances in the code base where we log user data and there is > a class of users that do not want this behavior. As an example, we might be > debugging an issue on the mailing list and the user has to scrub the logs > they share by hand because they don't want it to leak. > On the server-side, we should replace all those instances with some string > like "redacted" and add a process flag to enable the logging of user data. > On the client-side, it gets a bit more tricky. We can't use such flags so we > need to strike a balance between removing unnecessary logging of user > information and still keep the software usable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-1812) Redact user data that gets logged
[ https://issues.apache.org/jira/browse/KUDU-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763214#comment-15763214 ] Dan Burkert commented on KUDU-1812: --- Also working on predicates. Another avenue to search is calls to DebugString and ShortDebugString on protobuf messages. > Redact user data that gets logged > - > > Key: KUDU-1812 > URL: https://issues.apache.org/jira/browse/KUDU-1812 > Project: Kudu > Issue Type: Improvement >Reporter: Jean-Daniel Cryans >Priority: Critical > > There are many instances in the code base where we log user data and there is > a class of users that do not want this behavior. As an example, we might be > debugging an issue on the mailing list and the user has to scrub the logs > they share by hand because they don't want it to leak. > On the server-side, we should replace all those instances with some string > like "redacted" and add a process flag to enable the logging of user data. > On the client-side, it gets a bit more tricky. We can't use such flags so we > need to strike a balance between removing unnecessary logging of user > information and still keep the software usable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-1812) Redact user data that gets logged
[ https://issues.apache.org/jira/browse/KUDU-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762920#comment-15762920 ] Jean-Daniel Cryans commented on KUDU-1812: -- Java client: https://gerrit.cloudera.org/#/c/5549/ > Redact user data that gets logged > - > > Key: KUDU-1812 > URL: https://issues.apache.org/jira/browse/KUDU-1812 > Project: Kudu > Issue Type: Improvement >Reporter: Jean-Daniel Cryans >Priority: Critical > > There are many instances in the code base where we log user data and there is > a class of users that do not want this behavior. As an example, we might be > debugging an issue on the mailing list and the user has to scrub the logs > they share by hand because they don't want it to leak. > On the server-side, we should replace all those instances with some string > like "redacted" and add a process flag to enable the logging of user data. > On the client-side, it gets a bit more tricky. We can't use such flags so we > need to strike a balance between removing unnecessary logging of user > information and still keep the software usable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-1812) Redact user data that gets logged
[ https://issues.apache.org/jira/browse/KUDU-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762741#comment-15762741 ] Dan Burkert commented on KUDU-1812: --- I've identified a few major places where we log row data, or data derived from row data: * partition keys of individual rows * partial rows * encoded buffers / protobuf messages * predicates I'm working on adding a server side {{log_row_contents}} flag right now and auditing/removing all instances where we might be logging row partition keys in the C++ codebase. > Redact user data that gets logged > - > > Key: KUDU-1812 > URL: https://issues.apache.org/jira/browse/KUDU-1812 > Project: Kudu > Issue Type: Improvement >Reporter: Jean-Daniel Cryans >Priority: Critical > > There are many instances in the code base where we log user data and there is > a class of users that do not want this behavior. As an example, we might be > debugging an issue on the mailing list and the user has to scrub the logs > they share by hand because they don't want it to leak. > On the server-side, we should replace all those instances with some string > like "redacted" and add a process flag to enable the logging of user data. > On the client-side, it gets a bit more tricky. We can't use such flags so we > need to strike a balance between removing unnecessary logging of user > information and still keep the software usable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)