On Tue, Jul 30, 2019 at 11:54:55AM -0700, Jeff Davis wrote:
Logs are important to diagnose problems or monitor operations, but logs
can contain sensitive information which is often unnecessary for these
purposes. Redacting the sensitive information would enable easier
access and simpler integration with analysis tools without compromising
the sensitive information.


OK, that's a worthwhile goal. I assume by "sensitive data" you mean user
data, right?

The challenge is that nobody wants to classify all of the log messages;
and even if someone did that today, there would be never-ending work in
the future to try to maintain that classification.

My proposal is:

* redact every '%s' in an ereport by having a special mode for
snprintf.c (this is possible because we now own snprintf)
* generate both redacted and unredacted messages (if redaction is
enabled)
* choose which destinations (stderr, eventlog, syslog, csvlog) get
redacted or plain messages
* emit_log_hook always has both redacted and plain messages available
* allow specifying a custom redaction function, e.g. a function that
hashes the string rather than completely redacting it

I think '%s' in a log message is a pretty close match to the kind of
information that might be sensitive. All data goes through type output
functions (e.g. the conflicting datum for a unique constraint violation
message), and most other things that a user might type would go through
%s. A lot of other information useful in logs, like LSNs, %m's, PIDs,
etc. would be preserved.


IMHO the crucial part here is 'might be sensitive'. How often is that
actually true? My guess is 99% of places using %s are not sensitive at
all, and are used for things like filenames, table/attribute names,
and so on. And redacting those parts will make the logs essentially
useless, because we'll get things like this:

   ERROR:  column "******" does not exist at character 10

   ERROR:  division by zero
   CONTEXT:  SQL function "******" during inlining

I'm not sure those are the logs I'd like to see on a production system
while investigating an issue.

All object names would be redacted, but that's not as bad as it sounds:
 (a) You can specify a custom redaction function that hashes rather
than completely redacts. That allows you to see if different messages
refer to the same object, and also map back to suspected objects if you
really need to.
 (b) The unredacted object names are still a part of ErrorData, so you
can do something interesting with emit_log_hook.

Isn't hashing essentially an information leak, i.e. somewhat undesirable
for sensitive data?

 (c) You still might have the unredacted logs in a more protected
place, and can access them when you really need to.


The question is whether that's actually an acceptable solution for
deployments that do handle sensitive data ...

A weakness of this proposal is that it could be confusing to use
ereport() in combination with snprintf(). If using snprintf to build
the format string, nothing would be redacted, so you'd have to be
careful not to expand any %s that might be sensitive. If using snprintf
to build up an argument, the entire argument would be redacted. The
first case should not be common, because good coding generally avoids
non-constant format strings. The second case is just over-redaction,
which is not necessarily bad.

One annoying case would be if some of the arguments to ereport() are
used for things like the right number of commas or tabs -- redacting
those would just make the message look horrible. I didn't find such
cases but I'm pretty sure they exist. Another annoying case is time,
which is useful for debugging, but formatted with %s so it gets
redacted (I did find plenty of these cases).

But I don't see a better solution. Right now, it's a pain to treat log
files as sensitive things when there are so many ways they can help
with smooth operations and so many tools available to analyze them.
This proposal seems like a practical solution to enable better use of
log files while protecting potentially-sensitive information.


Hmm. I wonder how difficult would it be to actually go through the
ereport calls and classify those that can leak sensitive data, and then
do redaction only for those. That's about the only alternative approach
I can think of.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply via email to