Yes, partitionBy will shuffle unless it happens to be partitioning
with the exact same partitioner the parent rdd had.
On Wed, Oct 12, 2016 at 8:34 AM, Samy Dindane wrote:
> Hey Cody,
>
> I ended up choosing a different way to do things, which is using Kafka to
> commit my
Hey Cody,
I ended up choosing a different way to do things, which is using Kafka to
commit my offsets.
It works fine, except it stores the offset in ZK instead of a Kafka topic
(investigating this right now).
I understand your explanations, thank you, but I have one question: when you say
Repartition almost always involves a shuffle.
Let me see if I can explain the recovery stuff...
Say you start with two kafka partitions, topic-0 and topic-1.
You shuffle those across 3 spark parittions, we'll label them A B and C.
Your job is has written
fileA: results for A, offset ranges
On 10/10/2016 8:14 PM, Cody Koeninger wrote:
Glad it was helpful :)
As far as executors, my expectation is that if you have multiple
executors running, and one of them crashes, the failed task will be
submitted on a different executor. That is typically what I observe
in spark apps, if
Glad it was helpful :)
As far as executors, my expectation is that if you have multiple
executors running, and one of them crashes, the failed task will be
submitted on a different executor. That is typically what I observe
in spark apps, if that's not what you're seeing I'd try to get help on
I just noticed that you're the author of the code I linked in my previous
email. :) It's helpful.
When using `foreachPartition` or `mapPartitions`, I noticed I can't ask Spark
to write the data on the disk using `df.write()` but I need to use the iterator
to do so, which means losing the
What is it you're actually trying to accomplish?
On Mon, Oct 10, 2016 at 5:26 AM, Samy Dindane wrote:
> I managed to make a specific executor crash by using
> TaskContext.get.partitionId and throwing an exception for a specific
> executor.
>
> The issue I have now is that the