Re: How to hold some data in memory while processing rows in a DataFrame?
That sounds like it might fit the bill. I'll take a look - thanks! DR On Mon, Jan 22, 2018 at 11:26 PM, vermanuragwrote: > Looking at description of problem window functions may solve your issue. It > allows operation over a window that can include records before/ after the > particular record > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: How to hold some data in memory while processing rows in a DataFrame?
Thanks, but broadcast variables won't achieve won't I'm looking to do. I'm not trying to just share a one-time set of data across the cluster. Rather, I'm trying to set up a small cache of info that's constantly being updated based on the records in the dataframe. DR On Mon, Jan 22, 2018 at 10:41 PM, naresh Goudwrote: > If I understand your requirement correct. > Use broadcast variables to replicate across all nodes the small amount of > data you wanted to reuse. > > > > On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauch > wrote: > >> This seems like an easy thing to do, but I've been banging my head >> against the wall for hours trying to get it to work. >> >> I'm processing a spark dataframe (in python). What I want to do is, as >> I'm processing it I want to hold some data from one record in some local >> variables in memory, and then use those values later while I'm processing a >> subsequent record. But I can't see any way to do this. >> >> I tried using: >> >> dataframe.select(a_custom_udf_function('some_column')) >> >> ... and then reading/writing to local variables in the udf function, but >> I can't get this to work properly. >> >> My next guess would be to use dataframe.foreach(a_custom_function) and >> try to save data to local variables in there, but I have a suspicion that >> may not work either. >> >> >> What's the correct way to do something like this in Spark? In Hadoop I >> would just go ahead and declare local variables, and read and write to them >> in my map function as I like. (Although with the knowledge that a) the >> same map function would get repeatedly called for records with many >> different keys, and b) there would be many different instances of my code >> spread across many machines, and so each map function running on an >> instance would only see a subset of the records.) But in Spark it seems to >> be extraordinarily difficult to create local variables that can be read >> from / written to across different records in the dataframe. >> >> Perhaps there's something obvious I'm missing here? If so, any help >> would be greatly appreciated! >> >> Thanks, >> >> DR >> >>
Re: How to hold some data in memory while processing rows in a DataFrame?
Looking at description of problem window functions may solve your issue. It allows operation over a window that can include records before/ after the particular record -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How to hold some data in memory while processing rows in a DataFrame?
If I understand your requirement correct. Use broadcast variables to replicate across all nodes the small amount of data you wanted to reuse. On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauchwrote: > This seems like an easy thing to do, but I've been banging my head against > the wall for hours trying to get it to work. > > I'm processing a spark dataframe (in python). What I want to do is, as > I'm processing it I want to hold some data from one record in some local > variables in memory, and then use those values later while I'm processing a > subsequent record. But I can't see any way to do this. > > I tried using: > > dataframe.select(a_custom_udf_function('some_column')) > > ... and then reading/writing to local variables in the udf function, but I > can't get this to work properly. > > My next guess would be to use dataframe.foreach(a_custom_function) and try > to save data to local variables in there, but I have a suspicion that may > not work either. > > > What's the correct way to do something like this in Spark? In Hadoop I > would just go ahead and declare local variables, and read and write to them > in my map function as I like. (Although with the knowledge that a) the > same map function would get repeatedly called for records with many > different keys, and b) there would be many different instances of my code > spread across many machines, and so each map function running on an > instance would only see a subset of the records.) But in Spark it seems to > be extraordinarily difficult to create local variables that can be read > from / written to across different records in the dataframe. > > Perhaps there's something obvious I'm missing here? If so, any help would > be greatly appreciated! > > Thanks, > > DR > >
How to hold some data in memory while processing rows in a DataFrame?
This seems like an easy thing to do, but I've been banging my head against the wall for hours trying to get it to work. I'm processing a spark dataframe (in python). What I want to do is, as I'm processing it I want to hold some data from one record in some local variables in memory, and then use those values later while I'm processing a subsequent record. But I can't see any way to do this. I tried using: dataframe.select(a_custom_udf_function('some_column')) ... and then reading/writing to local variables in the udf function, but I can't get this to work properly. My next guess would be to use dataframe.foreach(a_custom_function) and try to save data to local variables in there, but I have a suspicion that may not work either. What's the correct way to do something like this in Spark? In Hadoop I would just go ahead and declare local variables, and read and write to them in my map function as I like. (Although with the knowledge that a) the same map function would get repeatedly called for records with many different keys, and b) there would be many different instances of my code spread across many machines, and so each map function running on an instance would only see a subset of the records.) But in Spark it seems to be extraordinarily difficult to create local variables that can be read from / written to across different records in the dataframe. Perhaps there's something obvious I'm missing here? If so, any help would be greatly appreciated! Thanks, DR