Hi, Sorry for little delay . As discussed in this thread, I have modified the Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer) code to have dedicated Receiver for every Topic Partition. You can see the example howto create Union of these receivers in consumer.kafka.client.Consumer.java .
Thanks to Chris for suggesting this change. Regards, Dibyendu On Mon, Sep 1, 2014 at 2:55 AM, RodrigoB <rodrigo.boav...@aspect.com> wrote: > Just a comment on the recovery part. > > Is it correct to say that currently Spark Streaming recovery design does > not > consider re-computations (upon metadata lineage recovery) that depend on > blocks of data of the received stream? > https://issues.apache.org/jira/browse/SPARK-1647 > > Just to illustrate a real use case (mine): > - We have object states which have a Duration field per state which is > incremented on every batch interval. Also this object state is reset to 0 > upon incoming state changing events. Let's supposed there is at least one > event since the last data checkpoint. This will lead to inconsistency upon > driver recovery: The Duration field will get incremented from the data > checkpoint version until the recovery moment, but the state change event > will never be re-processed...so in the end we have the old state with the > wrong Duration value. > To make things worst, let's imagine we're dumping the Duration increases > somewhere...which means we're spreading the problem across our system. > Re-computation awareness is something I've commented on another thread and > rather treat it separately. > > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-td12568.html#a13205 > > Re-computations do occur, but the only RDD's that are recovered are the > ones > from the data checkpoint. This is what we've seen. Is not enough by itself > to ensure recovery of computed data and this partial recovery leads to > inconsistency in some cases. > > Roger - I share the same question with you - I'm just not sure if the > replicated data really gets persisted on every batch. The execution lineage > is checkpointed, but if we have big chunks of data being consumed to > Receiver node on let's say a second bases then having it persisted to HDFS > every second could be a big challenge for keeping JVM performance - maybe > that could be reason why it's not really implemented...assuming it isn't. > > Dibyendu had a great effort with the offset controlling code but the > general > state consistent recovery feels to me like another big issue to address. > > I plan on having a dive into the Streaming code and try to at least > contribute with some ideas. Some more insight from anyone on the dev team > will be very appreciated. > > tnks, > Rod > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p13208.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >