Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

badgerpants Thu, 30 Apr 2015 12:43:49 -0700

Cody Koeninger-2 wrote
> In fact, you're using the 2 arg form of reduce by key to shrink it down to
> 1 partition
> 
>  reduceByKey(sumFunc, 1)
> 
> But you started with 4 kafka partitions?  So they're definitely no longer
> 1:1


True. I added the second arg because we were seeing multiple threads
attempting to update the same offset. Setting it to 1 prevented that but
doesn't fix the core issue.


Cody Koeninger-2 wrote
>> This is what I'm suggesting, in pseudocode
>>
>> rdd.mapPartitionsWithIndex { case (i, iter) =>
>>    offset = offsets(i)
>>    result = yourReductionFunction(iter)
>>    transaction {
>>       save(result)
>>       save(offset)
>>    }
>> }.foreach { (_: Nothing) => () }
>>
>> where yourReductionFunction is just normal scala code.
>>

I'll give this a try. Thanks, Cody.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p11928.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

Reply via email to