Hello, I did not understand very well your question. However, I can tell you that if you do .collect() on a RDD you are collecting all the data in the driver node. For this reason, you should use it only when the RDD is very small. Your function "validate_hostname" depends on a DataFrame. It's not possible to refer a DataFrame from a worker node, that's why that operation doesn't work. In the other case it works because the "map" is a function executed in the driver, not an RDD's method. In these cases you could use broadcast variables, but I have the intuition that, in general, you are using the wrong approach to solve the problem.
Best Regards, Matteo Cossu On 15 January 2018 at 12:56, <abdul.h.huss...@bt.com> wrote: > Hi, > > > > My Spark app is mapping lines from a text file to case classes stored > within an RDD. > > > > When I run the following code on this rdd: > > .collect.map(line => if(validate_hostname(line, data_frame)) > line).foreach(println) > > > > It correctly calls the method validate_hostname by passing the case class > and another data_frame defined within the main method. Unfortunately the > above map only returns a TraversableLike collection so I can’t do > transformations and joins on this data structure so I’m tried to apply a > filter on the rdd with the following code: > > .filter(line => validate_hostname(line, data_frame)).count() > > > > Unfortunately the above method with filtering the rdd does not pass the > data_frame so I get a NullPointerException though it correctly passes the > case class which I print within the method. > > > > Where am I going wrong? > > > > When > > > > Regards, > > Abdul Haseeb Hussain >