Hi,

I'm new to spark and I wanted to understand a few things conceptually so that I 
can optimize my spark job. I have a large text file (~14G, 200k lines). This 
file is available on each worker node of my spark cluster. The job I run calls 
sc.textFile(...).flatmap(...) . The function that I pass into flat map splits 
up each line from the file into a key and value. Now I have another text file 
which is smaller in size(~1.5G) but has a lot more lines because it has more 
than one value per key spread across multiple lines. . I call the same textFile 
and flatmap functions on they other file and then call groupByKey to have all 
values for a key available as a list. 

Having done this I then cogroup these 2 RDDs. I have the following questions

1. Is this sequence of steps the best way to achieve what I want, I.e a join 
across the 2 data sets?

2. I have a 8 node (25 Gb memory each) . The large file flatmap spawns about 
400 odd tasks whereas the small file flatmap only spawns about 30 odd tasks. 
The large file's flatmap takes about 2-3 mins and during this time it seems to 
do about 3G of shuffle write. I want to understand if this shuffle write is 
something I can avoid. From what I have read, the shuffle write is a disk 
write. Is that correct? Also is the reason for the shuffle write the fact that 
the partitioner for flatmap ends up having to redistribute the data across the 
cluster? 

Please let me know if I haven't provided enough information. I'm new to spark 
so if you see anything fundamental that I don't understand please feel free to 
just point me to a link that provides some detailed information.

Thanks,
Puneet

Reply via email to