I think the shuffle is unavoidable given that the input partitions
(probably hadoop input spits in your case) are not arranged in the way of a
cogroup job. But maybe you can try:
1) co-partition you data for cogroup:
val par = HashPartitioner(128)
val big =
Hi,
I'm new to spark and I wanted to understand a few things conceptually so that I
can optimize my spark job. I have a large text file (~14G, 200k lines). This
file is available on each worker node of my spark cluster. The job I run calls
sc.textFile(...).flatmap(...) . The function that I