Hi Dibyendu,What are your thoughts on keeping this solution (or not),
considering that Spark Streaming v1.2 will have built-in recoverability of
the received data?https://issues.apache.org/jira/browse/SPARK-1647I'm
concerned about the complexity of this solution with regards the added
complexity and performance overhead by the writing of big amounts of data
into HDFS on a small batch
interval.https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit?pli=1#
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n20181/spark_streaming_v.png>
I think the whole solution is well designed and thought but I'm afraid if it
does really fit all needs with checkpoint based technologies like Flume or
Kafka, by hiding away the management of the offset from the user code. If
instead of saving received data into HDFS, the ReceiverHandler would be
saving some metadata (such as offset in the case of Kafka) specified by the
custom receiver passed into the StreamingContext, then upon driver restart,
that metadata could be used by the custom receiver to recover the point from
which it should start receiving data once more.Anyone's comments will be
greatly appreciated.Tnks,Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p20181.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to