I am not sure. But in their RDD paper they have mentioned the usage of broadcast variable. Sometimes you may need local variable in many map-reduce jobs and you do not want to copy them to all worker nodes multiple times. Then the broadcast variable is a good choice
2013/11/7 Walrus theCat <walrusthe...@gmail.com> > Shangyu, > > Thanks for the tip re: the flag! Maybe the broadcast variable is only for > "complex" data structures? > > > On Sun, Nov 3, 2013 at 7:58 PM, Shangyu Luo <lsy...@gmail.com> wrote: > >> I met the problem of 'Too many open files' before. One solution is >> adding 'ulimit -n 100000' in the spark-env.sh file. >> Basically, I think the local variable may not be a problem as I have >> written programs with local variables as parameters for functions and the >> programs work. >> >> >> 2013/11/3 Walrus theCat <walrusthe...@gmail.com> >> >>> Hi Shangyu, >>> >>> I appreciate your ongoing correspondence. To clarify, my solution >>> didn't work, and I didn't expect it to. I was digging through the logs, and >>> I found a series of exceptions (in only one of the workers): >>> >>> 13/11/03 17:51:05 INFO client.DefaultHttpClient: Retrying connect >>> 13/11/03 17:51:05 INFO http.AmazonHttpClient: Unable to execute HTTP >>> request: Too many open files >>> java.net.SocketException: Too many open files >>> ... >>> at >>> com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:808) >>> ... >>> >>> I don't know why, because I do close those streams, but I'll look into it. >>> >>> >>> >>> As an aside, I make references to a spark.util.Vector from a parallelized >>> context (in an RDD.map operation), as per the Logistic Regression example >>> that Spark came with, and it seems to work out (the following from the >>> examples, you'll see that 'w' is not a broadcast variable, and 'points' is >>> an RDD): >>> >>> >>> >>> var w = Vector(D, _ => 2 * rand.nextDouble - 1) >>> println("Initial w: " + w) >>> >>> for (i <- 1 to ITERATIONS) { >>> println("On iteration " + i) >>> val gradient = points.map { p => >>> >>> >>> >>> (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x >>> }.reduce(_ + _) >>> w -= gradient >>> } >>> >>> >>> >>> >>> On Sun, Nov 3, 2013 at 10:47 AM, Shangyu Luo <lsy...@gmail.com> wrote: >>> >>>> Hi Walrus, >>>> Thank you for sharing your solution to your problem. I think I have met >>>> the similar problem before (i.e., one machine is working while others are >>>> idle.) and I just waits for a long time and the program will continue >>>> processing. I am not sure how your program filters an RDD by a locally >>>> stored set. If the set is a parameter of a function, I assume it should be >>>> copied to all worker nodes. But it is good that you solved your problem >>>> with a broadcast variable and the running time seems reasonable! >>>> >>>> >>>> 2013/11/3 Walrus theCat <walrusthe...@gmail.com> >>>> >>>>> Hi Shangyu, >>>>> >>>>> Thanks for responding. This is a refactor of other code that isn't >>>>> completely scalable because it pulls stuff to the driver. This code keeps >>>>> everything on the cluster. I left it running for 7 hours, and the log >>>>> just >>>>> froze. I checked ganglia, and only one machine's CPU seemed to be doing >>>>> anything. The last output on the log left my code at a spot where it is >>>>> filtering an RDD by a locally stored set. No error was thrown. I thought >>>>> that was OK based on the example code, but just in case, I changed it so >>>>> it's a broadcast variable. The un-refactored code (that pulls all the >>>>> data >>>>> to the driver from time to time) runs in minutes. I've never had the >>>>> problem before of the log just getting non-responsive, and was wondering >>>>> if >>>>> anyone knew of any heuristics I could check. >>>>> >>>>> Thank you >>>>> >>>>> >>>>> On Sat, Nov 2, 2013 at 2:55 PM, Shangyu Luo <lsy...@gmail.com> wrote: >>>>> >>>>>> Yes, I think so. The running time depends on what work your are doing >>>>>> and how large it is. >>>>>> >>>>>> >>>>>> 2013/11/1 Walrus theCat <walrusthe...@gmail.com> >>>>>> >>>>>>> That's what I thought, too. So is it not "hanging", just >>>>>>> recalculating for a very long time? The log stops updating and it just >>>>>>> gives the output I posted. If there are any suggestions as to >>>>>>> parameters >>>>>>> to change, or any other data, it would be appreciated. >>>>>>> >>>>>>> Thank you, Shangyu. >>>>>>> >>>>>>> >>>>>>> On Fri, Nov 1, 2013 at 11:31 AM, Shangyu Luo <lsy...@gmail.com>wrote: >>>>>>> >>>>>>>> I think the missing parent may be not abnormal. From my >>>>>>>> understanding, when a Spark task cannot find its parent, it can use >>>>>>>> some >>>>>>>> meta data to find the result of its parent or recalculate its parent's >>>>>>>> value. Imaging in a loop, a Spark task tries to find some value from >>>>>>>> the >>>>>>>> last iteration's result. >>>>>>>> >>>>>>>> >>>>>>>> 2013/11/1 Walrus theCat <walrusthe...@gmail.com> >>>>>>>> >>>>>>>>> Are there heuristics to check when the scheduler says it is >>>>>>>>> "missing parents" and just hangs? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Oct 31, 2013 at 4:56 PM, Walrus theCat < >>>>>>>>> walrusthe...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I'm not sure what's going on here. My code seems to be working >>>>>>>>>> thus far (map at SparkLR:90 completed.) What can I do to help the >>>>>>>>>> scheduler out here? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: Completed >>>>>>>>>> ShuffleMapTask(10, 211) >>>>>>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: Stage 10 (map at >>>>>>>>>> SparkLR.scala:90) finished in 0.923 s >>>>>>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: looking for newly >>>>>>>>>> runnable stages >>>>>>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: running: Set(Stage >>>>>>>>>> 11) >>>>>>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: waiting: Set(Stage >>>>>>>>>> 9, Stage 8) >>>>>>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: failed: Set() >>>>>>>>>> 13/10/31 02:10:16 INFO scheduler.DAGScheduler: Missing parents >>>>>>>>>> for Stage 9: List(Stage 11) >>>>>>>>>> 13/10/31 02:10:16 INFO scheduler.DAGScheduler: Missing parents >>>>>>>>>> for Stage 8: List(Stage 9) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -- >>>>>>>> >>>>>>>> Shangyu, Luo >>>>>>>> Department of Computer Science >>>>>>>> Rice University >>>>>>>> >>>>>>>> -- >>>>>>>> Not Just Think About It, But Do It! >>>>>>>> -- >>>>>>>> Success is never final. >>>>>>>> -- >>>>>>>> Losers always whine about their best >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> -- >>>>>> >>>>>> Shangyu, Luo >>>>>> Department of Computer Science >>>>>> Rice University >>>>>> >>>>>> -- >>>>>> Not Just Think About It, But Do It! >>>>>> -- >>>>>> Success is never final. >>>>>> -- >>>>>> Losers always whine about their best >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> -- >>>> >>>> Shangyu, Luo >>>> Department of Computer Science >>>> Rice University >>>> >>>> -- >>>> Not Just Think About It, But Do It! >>>> -- >>>> Success is never final. >>>> -- >>>> Losers always whine about their best >>>> >>> >>> >> >> >> -- >> -- >> >> Shangyu, Luo >> Department of Computer Science >> Rice University >> >> -- >> Not Just Think About It, But Do It! >> -- >> Success is never final. >> -- >> Losers always whine about their best >> > > -- -- Shangyu, Luo Department of Computer Science Rice University -- Not Just Think About It, But Do It! -- Success is never final. -- Losers always whine about their best