[ https://issues.apache.org/jira/browse/KAFKA-7510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658427#comment-16658427 ]
Mr Kafka commented on KAFKA-7510: --------------------------------- Some comments * Agree this should be handled consistently throughout the suite of Kafka + tools i.e connect etc etc. I only raised it on RecordCollectorImpl as this was where I noticed the issue of leaking data. * This does not contradict https://issues.apache.org/jira/browse/KAFKA-6538 it only affects it's implementation. * While every application is not sensitive we should do due diligence by default, especially with the markets Kafka is trying to work with. Data belongs in *log.dirs* not in log4j output by default at ERROR level ** The only way to suppress sensitive information is to disable ERROR level logs. Doing so would make it impossible to deploy any serious production deployment of KStreams in an heavy regulated environments without knowingly breaking some regulation i.e not leaking secret/sensitive information. ** There's no reason at ERROR level the log message cannot contain "Enable TRACE logging to see failed message contents" ** By moving the output to TRACE level a user has to actively enable dumping data to log4j, they have made the conscious choice and had to take in their own operational requirements so this becomes a feature switch. Further moving dumping raw data to DEBUG/TRACE level a user can set up a seperate log4j appender to handle this data, they can actively exclude it from going downstream, pipe it to it's own file which has further restricted access etc etc. Likewise as this has to be actively enabled enhanced contextual information can be added. ** key/values are generated by *toString*. If an application has large message sizes, large even being 1MB and also has high throughput, on large amounts of errors KStreams has the potential to denial of service it self on error by eating all available drive space, log4j log output will likely be on the OS volume while data / rocksdb on a separate volume in a prod deployment. > KStreams RecordCollectorImpl leaks data to logs on error > -------------------------------------------------------- > > Key: KAFKA-7510 > URL: https://issues.apache.org/jira/browse/KAFKA-7510 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Mr Kafka > Priority: Major > Labels: user-experience > > org.apache.kafka.streams.processor.internals.RecordCollectorImpl leaks data > on error as it dumps the *value* / message payload to the logs. > This is problematic as it may contain personally identifiable information > (pii) or other secret information to plain text log files which can then be > propagated to other log systems i.e Splunk. > I suggest the *key*, and *value* fields be moved to debug level as it is > useful for some people while error level contains the *errorMessage, > timestamp, topic* and *stackTrace*. > {code:java} > private <K, V> void recordSendError( > final K key, > final V value, > final Long timestamp, > final String topic, > final Exception exception > ) { > String errorLogMessage = LOG_MESSAGE; > String errorMessage = EXCEPTION_MESSAGE; > if (exception instanceof RetriableException) { > errorLogMessage += PARAMETER_HINT; > errorMessage += PARAMETER_HINT; > } > log.error(errorLogMessage, key, value, timestamp, topic, > exception.toString()); > sendException = new StreamsException( > String.format( > errorMessage, > logPrefix, > "an error caught", > key, > value, > timestamp, > topic, > exception.toString() > ), > exception); > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)