[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252210#comment-14252210
 ] 

Cody Koeninger commented on SPARK-3146:
---------------------------------------

(1) is important because MessageAndMetadata contains more information  than 
just a key and value, namely topic, partition and offset.  There's no reason to 
hide that information from users, and exposing it to them lets them fix all 
kinds of problems for themselves, namely SPARK-2388, manual offset management, 
and things I can't imagine yet.

(2) Regarding "interceptor", maybe this is just a semantic difference.  The 
function you're talking about _is_ flatMap, namely (Container[A], (A) => 
Container[B]) : Container[B] 

The important thing is as you note, that it runs before storage and offset 
checkpointing happens.  I'd find it unfortunate if the api was exposed in terms 
of an "Interceptor" interface that users had to implement and then use a setter 
for, rather than just simply passing in a function to the constructor... but 
ultimately, what matters is that the functionality is exposed.

> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3146
>                 URL: https://issues.apache.org/jira/browse/SPARK-3146
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.0.2, 1.1.0
>            Reporter: Saisai Shao
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to