Have you read / watched the materials linked from https://github.com/koeninger/kafka-exactly-once
On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke <jornfra...@gmail.com> wrote: > You need to do the book keeping of what has been processed yourself. This > may mean roughly the following (of course the devil is in the details): > Write down in zookeeper which part of the processing job has been done and > for which dataset all the data has been created (do not keep the data itself > in zookeeper). > Once you start a processing job, check in zookeeper if it has been > processed, if not remove all staging data, if yes terminate. > > As I said the details depend on your job and require some careful thinking, > but exactly once can be achieved with Spark (and potentially zookeeper or > similar, such as Redis). > Of course at the same time think if you need delivery in order etc. > > On 5 Dec 2016, at 08:59, Michal Šenkýř <bina...@gmail.com> wrote: > > Hello John, > > 1. If a task complete the operation, it will notify driver. The driver may > not receive the message due to the network, and think the task is still > running. Then the child stage won't be scheduled ? > > Spark's fault tolerance policy is, if there is a problem in processing a > task or an executor is lost, run the task (and any dependent tasks) again. > Spark attempts to minimize the number of tasks it has to recompute, so > usually only a small part of the data is recomputed. > > So in your case, the driver simply schedules the task on another executor > and continues to the next stage when it receives the data. > > 2. how do spark guarantee the downstream-task can receive the shuffle-data > completely. As fact, I can't find the checksum for blocks in spark. For > example, the upstream-task may shuffle 100Mb data, but the downstream-task > may receive 99Mb data due to network. Can spark verify the data is received > completely based size ? > > Spark uses compression with checksuming for shuffle data so it should know > when the data is corrupt and initiate a recomputation. > > As for your question in the subject: > All of this means that Spark supports at-least-once processing. There is no > way that I know of to ensure exactly-once. You can try to minimize > more-than-once situations by updating your offsets as soon as possible but > that does not eliminate the problem entirely. > > Hope this helps, > > Michal Senkyr --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org