Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mike Metzger
I've not done this in Scala yet, but in PySpark I've run into a similar issue where having too many dataframes cached does cause memory issues. Unpersist by itself did not clear the memory usage, but rather setting the variable equal to None allowed all the references to be cleared and the memory

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mungeol Heo
First of all, Thank you for your comments. Actually, What I mean "update" is generate a new data frame with modified data. The more detailed while loop will be something like below. var continue = 1 var dfA = "a data frame" dfA.persist while (continue > 0) { val temp = "modified dfA"

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Thakrar, Jayesh
Yes, iterating over a dataframe and making changes is not uncommon. Ofcourse RDDs, dataframes and datasets are immultable, but there is some optimization in the optimizer that can potentially help to dampen the effect/impact of creating a new rdd, df or ds. Also, the use-case you cited is

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Xi Shen
I think most of the "big data" tools, like Spark and Hive, are not designed to edit data. They are only designed to query data. I wonder in what scenario you need to update large volume of data repetitively. On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot wrote: > If my

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Divya Gehlot
If my understanding is correct about your query In spark Dataframes are immutable , cant update the dataframe. you have to create a new dataframe to update the current dataframe . Thanks, Divya On 17 October 2016 at 09:50, Mungeol Heo wrote: > Hello, everyone. > > As

Is spark a right tool for updating a dataframe repeatedly

2016-10-16 Thread Mungeol Heo
Hello, everyone. As I mentioned at the tile, I wonder that is spark a right tool for updating a data frame repeatedly until there is no more date to update. For example. while (if there was a updating) { update a data frame A } If it is the right tool, then what is the best practice for this