My main complain about the WAL mechanism in the new reliable kafka receiver is that you have to enable checkpointing and for some reason, even if spark.cleaner.ttl is set to a reasonable value, only the metadata is cleaned periodically. In my tests, using a folder in my filesystem as the checkpoint folder, the receivedMetaData folder remains almost constant in size but the receivedData folder is always increasing; the spark.cleaner.ttl was configured to 300 seconds.
2014-12-03 10:13 GMT+00:00 Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com>: > Hi, > > Yes, as Jerry mentioned, the Spark -3129 ( > https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature > which solves the Driver failure problem. The way 3129 is designed , it > solved the driver failure problem agnostic of the source of the stream ( > like Kafka or Flume etc) But with just 3129 you can not achieve complete > solution for data loss. You need a reliable receiver which should also > solves the data loss issue on receiver failure. > > The Low Level Consumer (https://github.com/dibbhatt/kafka-spark-consumer) > for which this email thread was started has solved that problem with Kafka > Low Level API. > > And Spark-4062 as Jerry mentioned also recently solved the same problem > using Kafka High Level API. > > On the Kafka High Level Consumer API approach , I would like to mention > that Kafka 0.8 has some issue as mentioned in this wiki ( > https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design) > where consumer re-balance sometime fails and that is one of the key reason > Kafka is re-writing consumer API in Kafka 0.9. > > I know there are few folks already have faced this re-balancing issues > while using Kafka High Level API , and If you ask my opinion, we at Pearson > are still using the Low Level Consumer as this seems to be more robust and > performant and we have been using this for few months without any issue > ..and also I may be little biased :) > > Regards, > Dibyendu > > > > On Wed, Dec 3, 2014 at 7:04 AM, Shao, Saisai <saisai.s...@intel.com> > wrote: > >> Hi Rod, >> >> The purpose of introducing WAL mechanism in Spark Streaming as a general >> solution is to make all the receivers be benefit from this mechanism. >> >> Though as you said, external sources like Kafka have their own checkpoint >> mechanism, instead of storing data in WAL, we can only store metadata to >> WAL, and recover from the last committed offsets. But this requires >> sophisticated design of Kafka receiver with low-level API involved, also we >> need to take care of rebalance and fault tolerance things by ourselves. So >> right now instead of implementing a whole new receiver, we choose to >> implement a simple one, though the performance is not so good, it's much >> easier to understand and maintain. >> >> The design purpose and implementation of reliable Kafka receiver can be >> found in (https://issues.apache.org/jira/browse/SPARK-4062). And in >> future, to improve the reliable Kafka receiver like what you mentioned is >> on our scheduler. >> >> Thanks >> Jerry >> >> >> -----Original Message----- >> From: RodrigoB [mailto:rodrigo.boav...@aspect.com] >> Sent: Wednesday, December 3, 2014 5:44 AM >> To: u...@spark.incubator.apache.org >> Subject: Re: Low Level Kafka Consumer for Spark >> >> Dibyendu, >> >> Just to make sure I will not be misunderstood - My concerns are referring >> to the Spark upcoming solution and not yours. I would to gather the >> perspective of someone which implemented recovery with Kafka a different >> way. >> >> Tnks, >> Rod >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20196.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional >> commands, e-mail: user-h...@spark.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >