Profiling tasks
Hi, Is it possible to increase the logging to get more details on what exactly are the tasks doing? I have slow operation for which Im trying to find out where is the time being spent. The operation is a cogroup() followed by a count(). In the logs on each worker node all I see is the fetch of map outputs which are not local. Thanks, Puneet -- Regards, Puneet
Re: Broadcast Variables
To answer my own question, that does seem to be the right way. I was concerned about whether the data that a broadcast variable would end up getting serialized if I used it as an instance variable of the function. I realized that doesnt happen because the broadcast variable's value is marked as transient. 1. Http - https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala 2. Torrent - https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala On Thu, May 22, 2014 at 6:58 PM, Puneet Lakhina wrote: > Hi, > > Im confused on what is the right way to use broadcast variables from java. > > My code looks something like this: > > Map<> val = //build Map to be broadcast > Broadcast> broadastVar = sc.broadcast(val); > > > sc.textFile(...).map(new SomeFunction()) { > //Do something here using broadcastVar > } > > My question is, should I pass the broadcastVar to the SomeFunction as a > constructor parameter that it can keep around as an instance variable i.e. > > sc.textFile(...).map(new SomeFunction(broadcastVar)) { > //Do something here using broadcastVar > } > > class SomeFunction extends Function { > public SomeFunction(Broadcast> var) { >this.var = var > } > > public T call() { > //Do something > } > } > > Is above the right way to utilize broadcast Variables when not using > anonymous inner classes as functions? > -- > Regards, > Puneet > > -- Regards, Puneet
Broadcast Variables
Hi, Im confused on what is the right way to use broadcast variables from java. My code looks something like this: Map<> val = //build Map to be broadcast Broadcast> broadastVar = sc.broadcast(val); sc.textFile(...).map(new SomeFunction()) { //Do something here using broadcastVar } My question is, should I pass the broadcastVar to the SomeFunction as a constructor parameter that it can keep around as an instance variable i.e. sc.textFile(...).map(new SomeFunction(broadcastVar)) { //Do something here using broadcastVar } class SomeFunction extends Function { public SomeFunction(Broadcast> var) { this.var = var } public T call() { //Do something } } Is above the right way to utilize broadcast Variables when not using anonymous inner classes as functions? -- Regards, Puneet
Text file and shuffle
Hi, I'm new to spark and I wanted to understand a few things conceptually so that I can optimize my spark job. I have a large text file (~14G, 200k lines). This file is available on each worker node of my spark cluster. The job I run calls sc.textFile(...).flatmap(...) . The function that I pass into flat map splits up each line from the file into a key and value. Now I have another text file which is smaller in size(~1.5G) but has a lot more lines because it has more than one value per key spread across multiple lines. . I call the same textFile and flatmap functions on they other file and then call groupByKey to have all values for a key available as a list. Having done this I then cogroup these 2 RDDs. I have the following questions 1. Is this sequence of steps the best way to achieve what I want, I.e a join across the 2 data sets? 2. I have a 8 node (25 Gb memory each) . The large file flatmap spawns about 400 odd tasks whereas the small file flatmap only spawns about 30 odd tasks. The large file's flatmap takes about 2-3 mins and during this time it seems to do about 3G of shuffle write. I want to understand if this shuffle write is something I can avoid. From what I have read, the shuffle write is a disk write. Is that correct? Also is the reason for the shuffle write the fact that the partitioner for flatmap ends up having to redistribute the data across the cluster? Please let me know if I haven't provided enough information. I'm new to spark so if you see anything fundamental that I don't understand please feel free to just point me to a link that provides some detailed information. Thanks, Puneet