[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252742#comment-14252742
 ] 

Saisai Shao commented on SPARK-3146:
------------------------------------

Hi all, thanks a lot for your comments. My original purpose of this proposal is 
to solve SPARK-2388 and make this solution more general, I'm not sure if other 
receivers have such requirements, also is it reasonable to add this interceptor 
to some receivers like socket, file ?

We can further improve the flexibility of streaming API by this interceptors, 
like filter out some unwanted messages, track the latency by timestamping the 
message once received. But the problem is that such flexibility will make user 
abusing this API to process the data once received, rather than by transforming 
function, this potentially disobey the semantics of Spark Streaming.

I think probably we should redesign the whole thing to make it both flexible 
and meaningful.


> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3146
>                 URL: https://issues.apache.org/jira/browse/SPARK-3146
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.0.2, 1.1.0
>            Reporter: Saisai Shao
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to