I think the shuffle is unavoidable given that the input partitions
(probably hadoop input spits in your case) are not arranged in the way of a
cogroup job. But maybe you can try:

  1) co-partition you data for cogroup:

    val par = HashPartitioner(128)
    val big = sc.textFile(..).map(...).partitionBy(par)
    val small = sc.textFile(...).map(...).partitionBy(par)
    ...

  See discussion in
https://groups.google.com/forum/#!topic/spark-users/gUyCSoFo5RI

  2) since you have 25GB mem on each node, you can use the broadcast
variable in spark to distribute the smaller dataset on each node and do
cogroup with it.



2014-05-18 4:41 GMT+02:00 Puneet Lakhina <puneet.lakh...@gmail.com>:

> Hi,
>
> I'm new to spark and I wanted to understand a few things conceptually so
> that I can optimize my spark job. I have a large text file (~14G, 200k
> lines). This file is available on each worker node of my spark cluster. The
> job I run calls sc.textFile(...).flatmap(...) . The function that I pass
> into flat map splits up each line from the file into a key and value. Now I
> have another text file which is smaller in size(~1.5G) but has a lot more
> lines because it has more than one value per key spread across multiple
> lines. . I call the same textFile and flatmap functions on they other file
> and then call groupByKey to have all values for a key available as a list.
>
> Having done this I then cogroup these 2 RDDs. I have the following
> questions
>
> 1. Is this sequence of steps the best way to achieve what I want, I.e a
> join across the 2 data sets?
>
> 2. I have a 8 node (25 Gb memory each) . The large file flatmap spawns
> about 400 odd tasks whereas the small file flatmap only spawns about 30 odd
> tasks. The large file's flatmap takes about 2-3 mins and during this time
> it seems to do about 3G of shuffle write. I want to understand if this
> shuffle write is something I can avoid. From what I have read, the shuffle
> write is a disk write. Is that correct? Also is the reason for the shuffle
> write the fact that the partitioner for flatmap ends up having to
> redistribute the data across the cluster?
>
> Please let me know if I haven't provided enough information. I'm new to
> spark so if you see anything fundamental that I don't understand please
> feel free to just point me to a link that provides some detailed
> information.
>
> Thanks,
> Puneet




-- 
*JU Han*

Data Engineer @ Botify.com

+33 0619608888

Reply via email to