Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-23 Thread David Rosenstrauch
That sounds like it might fit the bill.  I'll take a look - thanks!

DR

On Mon, Jan 22, 2018 at 11:26 PM, vermanurag 
wrote:

> Looking at description of problem window functions may solve your issue. It
>  allows operation over a window that can include records before/ after the
> particular record
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-23 Thread David Rosenstrauch
Thanks, but broadcast variables won't achieve won't I'm looking to do.  I'm
not trying to just share a one-time set of data across the cluster.
Rather, I'm trying to set up a small cache of info that's constantly being
updated based on the records in the dataframe.

DR

On Mon, Jan 22, 2018 at 10:41 PM, naresh Goud 
wrote:

> If I understand your requirement correct.
> Use broadcast variables to replicate across all nodes the small amount of
> data you wanted to reuse.
>
>
>
> On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauch 
> wrote:
>
>> This seems like an easy thing to do, but I've been banging my head
>> against the wall for hours trying to get it to work.
>>
>> I'm processing a spark dataframe (in python).  What I want to do is, as
>> I'm processing it I want to hold some data from one record in some local
>> variables in memory, and then use those values later while I'm processing a
>> subsequent record.  But I can't see any way to do this.
>>
>> I tried using:
>>
>> dataframe.select(a_custom_udf_function('some_column'))
>>
>> ... and then reading/writing to local variables in the udf function, but
>> I can't get this to work properly.
>>
>> My next guess would be to use dataframe.foreach(a_custom_function) and
>> try to save data to local variables in there, but I have a suspicion that
>> may not work either.
>>
>>
>> What's the correct way to do something like this in Spark?  In Hadoop I
>> would just go ahead and declare local variables, and read and write to them
>> in my map function as I like.  (Although with the knowledge that a) the
>> same map function would get repeatedly called for records with many
>> different keys, and b) there would be many different instances of my code
>> spread across many machines, and so each map function running on an
>> instance would only see a subset of the records.)  But in Spark it seems to
>> be extraordinarily difficult to create local variables that can be read
>> from / written to across different records in the dataframe.
>>
>> Perhaps there's something obvious I'm missing here?  If so, any help
>> would be greatly appreciated!
>>
>> Thanks,
>>
>> DR
>>
>>


Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-22 Thread vermanurag
Looking at description of problem window functions may solve your issue. It
 allows operation over a window that can include records before/ after the
particular record




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-22 Thread naresh Goud
If I understand your requirement correct.
Use broadcast variables to replicate across all nodes the small amount of
data you wanted to reuse.



On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauch 
wrote:

> This seems like an easy thing to do, but I've been banging my head against
> the wall for hours trying to get it to work.
>
> I'm processing a spark dataframe (in python).  What I want to do is, as
> I'm processing it I want to hold some data from one record in some local
> variables in memory, and then use those values later while I'm processing a
> subsequent record.  But I can't see any way to do this.
>
> I tried using:
>
> dataframe.select(a_custom_udf_function('some_column'))
>
> ... and then reading/writing to local variables in the udf function, but I
> can't get this to work properly.
>
> My next guess would be to use dataframe.foreach(a_custom_function) and try
> to save data to local variables in there, but I have a suspicion that may
> not work either.
>
>
> What's the correct way to do something like this in Spark?  In Hadoop I
> would just go ahead and declare local variables, and read and write to them
> in my map function as I like.  (Although with the knowledge that a) the
> same map function would get repeatedly called for records with many
> different keys, and b) there would be many different instances of my code
> spread across many machines, and so each map function running on an
> instance would only see a subset of the records.)  But in Spark it seems to
> be extraordinarily difficult to create local variables that can be read
> from / written to across different records in the dataframe.
>
> Perhaps there's something obvious I'm missing here?  If so, any help would
> be greatly appreciated!
>
> Thanks,
>
> DR
>
>


How to hold some data in memory while processing rows in a DataFrame?

2018-01-22 Thread David Rosenstrauch
 This seems like an easy thing to do, but I've been banging my head against
the wall for hours trying to get it to work.

I'm processing a spark dataframe (in python).  What I want to do is, as I'm
processing it I want to hold some data from one record in some local
variables in memory, and then use those values later while I'm processing a
subsequent record.  But I can't see any way to do this.

I tried using:

dataframe.select(a_custom_udf_function('some_column'))

... and then reading/writing to local variables in the udf function, but I
can't get this to work properly.

My next guess would be to use dataframe.foreach(a_custom_function) and try
to save data to local variables in there, but I have a suspicion that may
not work either.


What's the correct way to do something like this in Spark?  In Hadoop I
would just go ahead and declare local variables, and read and write to them
in my map function as I like.  (Although with the knowledge that a) the
same map function would get repeatedly called for records with many
different keys, and b) there would be many different instances of my code
spread across many machines, and so each map function running on an
instance would only see a subset of the records.)  But in Spark it seems to
be extraordinarily difficult to create local variables that can be read
from / written to across different records in the dataframe.

Perhaps there's something obvious I'm missing here?  If so, any help would
be greatly appreciated!

Thanks,

DR