Cody Koeninger-2 wrote
> In fact, you're using the 2 arg form of reduce by key to shrink it down to
> 1 partition
>
> reduceByKey(sumFunc, 1)
>
> But you started with 4 kafka partitions? So they're definitely no longer
> 1:1
True. I added the second arg because we were seeing multiple threads
attempting to update the same offset. Setting it to 1 prevented that but
doesn't fix the core issue.
Cody Koeninger-2 wrote
>> This is what I'm suggesting, in pseudocode
>>
>> rdd.mapPartitionsWithIndex { case (i, iter) =>
>> offset = offsets(i)
>> result = yourReductionFunction(iter)
>> transaction {
>> save(result)
>> save(offset)
>> }
>> }.foreach { (_: Nothing) => () }
>>
>> where yourReductionFunction is just normal scala code.
>>
I'll give this a try. Thanks, Cody.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p11928.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]