Can you go ahead and open a Jira ticket with that explanation?
Is there a reason you need to use receivers instead of the direct stream?
On Tue, Jul 26, 2016 at 4:45 AM, Andy Zhao <andyrao1...@gmail.com> wrote:
> Hi guys,
>
> I wrote a spark streaming program which consume 1000 messages from one
> topic of Kafka, did some transformation, and wrote the result back to
> another topic. But only found 988 messages in the second topic. I checked
> log info and confirmed all messages was received by receivers. But I found a
> hdfs writing time out message printed from Class BatchedWriteAheadLog.
>
> I checkout source code and found code like this:
>
> /** Add received block. This event will get written to the write ahead
> log (if enabled). */
> def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
> try {
> val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo))
> if (writeResult) {
> synchronized {
> getReceivedBlockQueue(receivedBlockInfo.streamId) +=
> receivedBlockInfo
> }
> logDebug(s"Stream ${receivedBlockInfo.streamId} received " +
> s"block ${receivedBlockInfo.blockStoreResult.blockId}")
> } else {
> logDebug(s"Failed to acknowledge stream
> ${receivedBlockInfo.streamId} receiving " +
> s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write
> Ahead Log.")
> }
> writeResult
> } catch {
> case NonFatal(e) =>
> logError(s"Error adding block $receivedBlockInfo", e)
> false
> }
> }
>
>
> It seems that ReceiverTracker tries to write block info to hdfs, but the
> write operation time out, this cause writeToLog function return false, and
> this code "getReceivedBlockQueue(receivedBlockInfo.streamId) +=
> receivedBlockInfo" is skipped. so the block info is lost.
>
>The spark version I use is 1.6.1 and I did not turn on
> spark.streaming.receiver.writeAheadLog.enable.
>
> I want to know whether or not this is a designed behaviour.
>
> Thanks
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-lost-data-when-ReceiverTracker-writes-Blockinfo-to-hdfs-timeout-tp27410.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org