[ https://issues.apache.org/jira/browse/SPARK-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-4174: ----------------------------- Component/s: Streaming > Streaming: Optionally provide notifications to Receivers when DStream has > been generated > ---------------------------------------------------------------------------------------- > > Key: SPARK-4174 > URL: https://issues.apache.org/jira/browse/SPARK-4174 > Project: Spark > Issue Type: Improvement > Components: Streaming > Reporter: Hari Shreedharan > Assignee: Hari Shreedharan > > Receivers receiving data from Message Queues, like Active MQ, Kafka etc can > replay messages if required. Using the HDFS WAL mechanism for such systems > affects efficiency as we are incurring an unnecessary HDFS write when we can > recover the data from the queue anyway. > We can fix this by providing a notification to the receiver when the RDD is > generated from the blocks. We need to consider the case where a receiver > might fail before the RDD is generated and come back on a different executor > when the RDD is generated. Either way, this is likely to cause duplicates and > not data loss -- so we may be ok. > I am thinking about something of the order of accepting a callback function > which gets called when the RDD is generated. We can keep the function local > in a map of batch id -> function, which gets called when the function gets > generated (we can inform the ReceiverSupervisorImpl via Akka when the driver > generates the RDD). Of course, just an early thought - I will work on a > design doc for this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org