My main complain about the WAL mechanism in the new reliable kafka receiver
is that you have to enable checkpointing and for some reason, even if
spark.cleaner.ttl is set to a reasonable value, only the metadata is
cleaned periodically. In my tests, using a folder in my filesystem as the
checkpoint folder, the receivedMetaData folder remains almost constant in
size but the receivedData folder is always increasing; the spark.cleaner.ttl
was configured to 300 seconds.

2014-12-03 10:13 GMT+00:00 Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com>:

> Hi,
>
> Yes, as Jerry mentioned, the Spark -3129 (
> https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature
> which solves the Driver failure problem. The way 3129 is designed , it
> solved the driver failure problem agnostic of the source of the stream (
> like Kafka or Flume etc) But with just 3129 you can not achieve complete
> solution for data loss. You need a reliable receiver which should also
> solves the data loss issue on receiver failure.
>
> The Low Level Consumer (https://github.com/dibbhatt/kafka-spark-consumer)
> for which this email thread was started has solved that problem with Kafka
> Low Level API.
>
> And Spark-4062 as Jerry mentioned also recently solved the same problem
> using Kafka High Level API.
>
> On the Kafka High Level Consumer API approach , I would like to mention
> that Kafka 0.8 has some issue as mentioned in this wiki (
> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design)
> where consumer re-balance sometime fails and that is one of the key reason
> Kafka is re-writing consumer API in Kafka 0.9.
>
> I know there are few folks already have faced this re-balancing issues
> while using Kafka High Level API , and If you ask my opinion, we at Pearson
> are still using the Low Level Consumer as this seems to be more robust and
> performant and we have been using this for few months without any issue
> ..and also I may be little biased :)
>
> Regards,
> Dibyendu
>
>
>
> On Wed, Dec 3, 2014 at 7:04 AM, Shao, Saisai <saisai.s...@intel.com>
> wrote:
>
>> Hi Rod,
>>
>> The purpose of introducing  WAL mechanism in Spark Streaming as a general
>> solution is to make all the receivers be benefit from this mechanism.
>>
>> Though as you said, external sources like Kafka have their own checkpoint
>> mechanism, instead of storing data in WAL, we can only store metadata to
>> WAL, and recover from the last committed offsets. But this requires
>> sophisticated design of Kafka receiver with low-level API involved, also we
>> need to take care of rebalance and fault tolerance things by ourselves. So
>> right now instead of implementing a whole new receiver, we choose to
>> implement a simple one, though the performance is not so good, it's much
>> easier to understand and maintain.
>>
>> The design purpose and implementation of reliable Kafka receiver can be
>> found in (https://issues.apache.org/jira/browse/SPARK-4062). And in
>> future, to improve the reliable Kafka receiver like what you mentioned is
>> on our scheduler.
>>
>> Thanks
>> Jerry
>>
>>
>> -----Original Message-----
>> From: RodrigoB [mailto:rodrigo.boav...@aspect.com]
>> Sent: Wednesday, December 3, 2014 5:44 AM
>> To: u...@spark.incubator.apache.org
>> Subject: Re: Low Level Kafka Consumer for Spark
>>
>> Dibyendu,
>>
>> Just to make sure I will not be misunderstood - My concerns are referring
>> to the Spark upcoming solution and not yours. I would to gather the
>> perspective of someone which implemented recovery with Kafka a different
>> way.
>>
>> Tnks,
>> Rod
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20196.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
>> commands, e-mail: user-h...@spark.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to