Re: Text file and shuffle

2014-05-18 Thread Han JU
I think the shuffle is unavoidable given that the input partitions (probably hadoop input spits in your case) are not arranged in the way of a cogroup job. But maybe you can try: 1) co-partition you data for cogroup: val par = HashPartitioner(128) val big =

Text file and shuffle

2014-05-17 Thread Puneet Lakhina
Hi, I'm new to spark and I wanted to understand a few things conceptually so that I can optimize my spark job. I have a large text file (~14G, 200k lines). This file is available on each worker node of my spark cluster. The job I run calls sc.textFile(...).flatmap(...) . The function that I