GitHub user gaborgsomogyi opened a pull request: https://github.com/apache/spark/pull/20620
[SPARK-23438][DSTREAMS] Fix DStreams data loss with WAL when driver crashes ## What changes were proposed in this pull request? There is a race condition introduced in SPARK-11141 which could cause data loss. The problem is that ReceivedBlockTracker.insertAllocatedBatch function assumes that all blocks from streamIdToUnallocatedBlockQueues allocated to the batch and clears the queue. In this PR only the allocated blocks will be removed from the queue which will prevent data loss. ## How was this patch tested? Additional unit test + manually. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gaborgsomogyi/spark SPARK-23438 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20620.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20620 ---- commit 152fec431218161e538c377a6cb82753100dc70b Author: Gabor Somogyi <gabor.g.somogyi@...> Date: 2018-02-09T08:30:19Z [SPARK-23438][DSTREAMS] Fix DStreams data loss with WAL when driver crashes ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org