[ 
https://issues.apache.org/jira/browse/KAFKA-7510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658427#comment-16658427
 ] 

Mr Kafka commented on KAFKA-7510:
---------------------------------

Some comments
 * Agree this should be handled consistently throughout the suite of Kafka + 
tools i.e connect etc etc. I only raised it on RecordCollectorImpl as this was 
where I noticed the issue of leaking data.
 * This does not contradict https://issues.apache.org/jira/browse/KAFKA-6538 it 
only affects it's implementation.
 * While every application is not sensitive we should do due diligence by 
default, especially with the markets Kafka is trying to work with. Data belongs 
in *log.dirs* not in log4j output by default at ERROR level
 ** The only way to suppress sensitive information is to disable ERROR level 
logs. Doing so would make it impossible to deploy any serious production 
deployment of KStreams in an heavy regulated environments without knowingly 
breaking some regulation i.e not leaking secret/sensitive information.
 ** There's no reason at ERROR level the log message cannot contain "Enable 
TRACE logging to see failed message contents"
 ** By moving the output to TRACE level a user has to actively enable dumping 
data to log4j, they have made the conscious choice and had to take in their own 
operational requirements so this becomes a feature switch. Further moving 
dumping raw data to DEBUG/TRACE level a user can set up a seperate log4j 
appender to handle this data, they can actively exclude it from going 
downstream, pipe it to it's own file which has further restricted access etc 
etc. Likewise as this has to be actively enabled enhanced contextual 
information can be added.
 ** key/values are generated by *toString*. If an application has large message 
sizes, large even being 1MB and also has high throughput, on large amounts of 
errors KStreams has the potential to denial of service it self on error by 
eating all available drive space, log4j log output will likely be on the OS 
volume while data / rocksdb on a separate volume in a prod deployment.

 

> KStreams RecordCollectorImpl leaks data to logs on error
> --------------------------------------------------------
>
>                 Key: KAFKA-7510
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7510
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Mr Kafka
>            Priority: Major
>              Labels: user-experience
>
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl leaks data 
> on error as it dumps the *value* / message payload to the logs.
> This is problematic as it may contain personally identifiable information 
> (pii) or other secret information to plain text log files which can then be 
> propagated to other log systems i.e Splunk.
> I suggest the *key*, and *value* fields be moved to debug level as it is 
> useful for some people while error level contains the *errorMessage, 
> timestamp, topic* and *stackTrace*.
> {code:java}
> private <K, V> void recordSendError(
>     final K key,
>     final V value,
>     final Long timestamp,
>     final String topic,
>     final Exception exception
> ) {
>     String errorLogMessage = LOG_MESSAGE;
>     String errorMessage = EXCEPTION_MESSAGE;
>     if (exception instanceof RetriableException) {
>         errorLogMessage += PARAMETER_HINT;
>         errorMessage += PARAMETER_HINT;
>     }
>     log.error(errorLogMessage, key, value, timestamp, topic, 
> exception.toString());
>     sendException = new StreamsException(
>         String.format(
>             errorMessage,
>             logPrefix,
>             "an error caught",
>             key,
>             value,
>             timestamp,
>             topic,
>             exception.toString()
>         ),
>         exception);
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to