First of all, Thank you for your comments. Actually, What I mean "update" is generate a new data frame with modified data. The more detailed while loop will be something like below.
var continue = 1 var dfA = "a data frame" dfA.persist while (continue > 0) { val temp = "modified dfA" temp.persist temp.count dfA.unpersist dfA = "modified temp" dfA.persist dfA.count temp.unperist if ("dfA is not modifed") { continue = 0 } } The problem is it will cause OOM finally. And, the number of skipped stages will increase ever time, even though I am not sure whether this is the reason causing OOM. Maybe, I need to check the source code of one of the spark ML algorithms. Again, thank you all. On Mon, Oct 17, 2016 at 10:54 PM, Thakrar, Jayesh <jthak...@conversantmedia.com> wrote: > Yes, iterating over a dataframe and making changes is not uncommon. > > Ofcourse RDDs, dataframes and datasets are immultable, but there is some > optimization in the optimizer that can potentially help to dampen the > effect/impact of creating a new rdd, df or ds. > > Also, the use-case you cited is similar to what is done in regression, > clustering and other algorithms. > > I.e. you iterate making a change to a dataframe/dataset until the desired > condition. > > E.g. see this - > https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression > and the setting of the iteration ceiling > > > > // instantiate the base classifier > > val classifier = new LogisticRegression() > > .setMaxIter(params.maxIter) > > .setTol(params.tol) > > .setFitIntercept(params.fitIntercept) > > > > Now the impact of that depends on a variety of things. > > E.g. if the data is completely contained in memory and there is no spill > over to disk, it might not be a big issue (ofcourse there will still be > memory, CPU and network overhead/latency). > > If you are looking at storing the data on disk (e.g. as part of a checkpoint > or explicit storage), then there can be substantial I/O activity. > > > > > > > > From: Xi Shen <davidshe...@gmail.com> > Date: Monday, October 17, 2016 at 2:54 AM > To: Divya Gehlot <divya.htco...@gmail.com>, Mungeol Heo > <mungeol....@gmail.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: Is spark a right tool for updating a dataframe repeatedly > > > > I think most of the "big data" tools, like Spark and Hive, are not designed > to edit data. They are only designed to query data. I wonder in what > scenario you need to update large volume of data repetitively. > > > > > > On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot <divya.htco...@gmail.com> > wrote: > > If my understanding is correct about your query > > In spark Dataframes are immutable , cant update the dataframe. > > you have to create a new dataframe to update the current dataframe . > > > > > > Thanks, > > Divya > > > > > > On 17 October 2016 at 09:50, Mungeol Heo <mungeol....@gmail.com> wrote: > > Hello, everyone. > > As I mentioned at the tile, I wonder that is spark a right tool for > updating a data frame repeatedly until there is no more date to > update. > > For example. > > while (if there was a updating) { > update a data frame A > } > > If it is the right tool, then what is the best practice for this kind of > work? > Thank you. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > -- > > > Thanks, > David S. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org