[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252210#comment-14252210 ]
Cody Koeninger commented on SPARK-3146: --------------------------------------- (1) is important because MessageAndMetadata contains more information than just a key and value, namely topic, partition and offset. There's no reason to hide that information from users, and exposing it to them lets them fix all kinds of problems for themselves, namely SPARK-2388, manual offset management, and things I can't imagine yet. (2) Regarding "interceptor", maybe this is just a semantic difference. The function you're talking about _is_ flatMap, namely (Container[A], (A) => Container[B]) : Container[B] The important thing is as you note, that it runs before storage and offset checkpointing happens. I'd find it unfortunate if the api was exposed in terms of an "Interceptor" interface that users had to implement and then use a setter for, rather than just simply passing in a function to the constructor... but ultimately, what matters is that the functionality is exposed. > Improve the flexibility of Spark Streaming Kafka API to offer user the > ability to process message before storing into BM > ------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-3146 > URL: https://issues.apache.org/jira/browse/SPARK-3146 > Project: Spark > Issue Type: Improvement > Components: Streaming > Affects Versions: 1.0.2, 1.1.0 > Reporter: Saisai Shao > > Currently Spark Streaming Kafka API stores the key and value of each message > into BM for processing, potentially this may lose the flexibility for > different requirements: > 1. currently topic/partition/offset information for each message is discarded > by KafkaInputDStream. In some scenarios people may need this information to > better filter the message, like SPARK-2388 described. > 2. People may need to add timestamp for each message when feeding into Spark > Streaming, which can better measure the system latency. > 3. Checkpointing the partition/offsets or others... > So here we add a messageHandler in interface to give people the flexibility > to preprocess the message before storing into BM. In the meantime time this > improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org