Hello,
I did not understand very well your question.
However, I can tell you that if you do .collect() on a RDD you are
collecting all the data in the driver node. For this reason, you should use
it only when the RDD is very small.
Your function "validate_hostname" depends on a DataFrame. It's not possible
to refer a DataFrame from a worker node, that's why that operation doesn't
work. In the other case it works because the "map" is a function executed
in the driver, not an RDD's method.
In these cases you could use broadcast variables, but I have the intuition
that, in general, you are using the wrong approach to solve the problem.

Best Regards,

Matteo Cossu


On 15 January 2018 at 12:56, <abdul.h.huss...@bt.com> wrote:

> Hi,
>
>
>
> My Spark app is mapping lines from a text file to case classes stored
> within an RDD.
>
>
>
> When I run the following code on this rdd:
>
> .collect.map(line => if(validate_hostname(line, data_frame))
> line).foreach(println)
>
>
>
> It correctly calls the method validate_hostname by passing the case class
> and another data_frame defined within the main method. Unfortunately the
> above map only returns a TraversableLike collection so I can’t do
> transformations and joins on this data structure so I’m tried to apply a
> filter on the rdd with the following code:
>
> .filter(line => validate_hostname(line, data_frame)).count()
>
>
>
> Unfortunately the above method with filtering the rdd does not pass the
> data_frame so I get a NullPointerException though it correctly passes the
> case class which I print within the method.
>
>
>
> Where am I going wrong?
>
>
>
> When
>
>
>
> Regards,
>
> Abdul Haseeb Hussain
>

Reply via email to