[ 
https://issues.apache.org/jira/browse/SPARK-7139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7139:
---------------------------------
    Description: 
The received API allows arbitrary metadata to be added for each block. However 
that information is not saved in the WAL as part of the block information in 
the driver. 

To fix this, the following needs to be done. 

1. Forward the metadata to the ReceivedBlockTracker in the driver.
2. ReceivedBlockTracker saves the metadata and recovers it on restart. 

However there is one tricky thing. The ReceivedBlockTracker WAL is enabled only 
when `spark.streaming.receiver.writeAheadLog.enable = true`. This means that 
only when  receiver WAL is enabled is the driver WAL enabled. This is not 
desired as the one may want to save and recovered block metadata information 
(especially information like Kafka offsets or Kinesis sequence numbers) that 
can be used to recover data without actually saving the data to the receiver 
WAL. So we have to always enable the tracker WAL. 

3. Always enable the ReceivedBlockTracker WAL. However, make sure that the 
WriteAheadLogBackedBlockRDD skips block lookup after restart as the blocks are 
obviously gone.

  was:
The received API allows arbitrary metadata to be added for each block. However 
that information is not saved in the WAL as part of the block information in 
the driver. 

To fix this, the following needs to be done. 

1. Forward the metadata to the ReceivedBlockTracker in the driver.
2. ReceivedBlockTracker saves the metadata and recovers it on restart. 

However there is one tricky thing. The ReceivedBlockTracker WAL is enabled only 
when `spark.streaming.receiver.writeAheadLog.enable = true`. This means that 
only when  receiver WAL is enabled is the driver WAL enabled. This is not 
desired as the one may want to save and recovered block metadata information 
(especially information like Kafka offsets or Kinesis sequence numbers) that 
can be used to recover data without actually saving the data to the receiver 
WAL. So we have to always enable the tracker WAL. 

3. Always enable the ReceivedBlockTracker WAL. However, make sure that the 
blockIds are not recovered as they will not be useful after driver restart (the 
blocks are gone!). 


> Allow received block metadata to be saved to WAL and recovered on driver 
> failure
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-7139
>                 URL: https://issues.apache.org/jira/browse/SPARK-7139
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>            Priority: Critical
>
> The received API allows arbitrary metadata to be added for each block. 
> However that information is not saved in the WAL as part of the block 
> information in the driver. 
> To fix this, the following needs to be done. 
> 1. Forward the metadata to the ReceivedBlockTracker in the driver.
> 2. ReceivedBlockTracker saves the metadata and recovers it on restart. 
> However there is one tricky thing. The ReceivedBlockTracker WAL is enabled 
> only when `spark.streaming.receiver.writeAheadLog.enable = true`. This means 
> that only when  receiver WAL is enabled is the driver WAL enabled. This is 
> not desired as the one may want to save and recovered block metadata 
> information (especially information like Kafka offsets or Kinesis sequence 
> numbers) that can be used to recover data without actually saving the data to 
> the receiver WAL. So we have to always enable the tracker WAL. 
> 3. Always enable the ReceivedBlockTracker WAL. However, make sure that the 
> WriteAheadLogBackedBlockRDD skips block lookup after restart as the blocks 
> are obviously gone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to