Re: How to hold some data in memory while processing rows in a DataFrame?

naresh Goud Mon, 22 Jan 2018 19:41:50 -0800

If I understand your requirement correct.
Use broadcast variables to replicate across all nodes the small amount of
data you wanted to reuse.




On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauch <daro...@gmail.com>
wrote:

> This seems like an easy thing to do, but I've been banging my head against
> the wall for hours trying to get it to work.
>
> I'm processing a spark dataframe (in python).  What I want to do is, as
> I'm processing it I want to hold some data from one record in some local
> variables in memory, and then use those values later while I'm processing a
> subsequent record.  But I can't see any way to do this.
>
> I tried using:
>
> dataframe.select(a_custom_udf_function('some_column'))
>
> ... and then reading/writing to local variables in the udf function, but I
> can't get this to work properly.
>
> My next guess would be to use dataframe.foreach(a_custom_function) and try
> to save data to local variables in there, but I have a suspicion that may
> not work either.
>
>
> What's the correct way to do something like this in Spark?  In Hadoop I
> would just go ahead and declare local variables, and read and write to them
> in my map function as I like.  (Although with the knowledge that a) the
> same map function would get repeatedly called for records with many
> different keys, and b) there would be many different instances of my code
> spread across many machines, and so each map function running on an
> instance would only see a subset of the records.)  But in Spark it seems to
> be extraordinarily difficult to create local variables that can be read
> from / written to across different records in the dataframe.
>
> Perhaps there's something obvious I'm missing here?  If so, any help would
> be greatly appreciated!
>
> Thanks,
>
> DR
>
>

Re: How to hold some data in memory while processing rows in a DataFrame?

Reply via email to