Yes, iterating over a dataframe and making changes is not uncommon. Ofcourse RDDs, dataframes and datasets are immultable, but there is some optimization in the optimizer that can potentially help to dampen the effect/impact of creating a new rdd, df or ds. Also, the use-case you cited is similar to what is done in regression, clustering and other algorithms. I.e. you iterate making a change to a dataframe/dataset until the desired condition. E.g. see this - https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression and the setting of the iteration ceiling
// instantiate the base classifier val classifier = new LogisticRegression() .setMaxIter(params.maxIter) .setTol(params.tol) .setFitIntercept(params.fitIntercept) Now the impact of that depends on a variety of things. E.g. if the data is completely contained in memory and there is no spill over to disk, it might not be a big issue (ofcourse there will still be memory, CPU and network overhead/latency). If you are looking at storing the data on disk (e.g. as part of a checkpoint or explicit storage), then there can be substantial I/O activity. From: Xi Shen <davidshe...@gmail.com> Date: Monday, October 17, 2016 at 2:54 AM To: Divya Gehlot <divya.htco...@gmail.com>, Mungeol Heo <mungeol....@gmail.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: Is spark a right tool for updating a dataframe repeatedly I think most of the "big data" tools, like Spark and Hive, are not designed to edit data. They are only designed to query data. I wonder in what scenario you need to update large volume of data repetitively. On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> wrote: If my understanding is correct about your query In spark Dataframes are immutable , cant update the dataframe. you have to create a new dataframe to update the current dataframe . Thanks, Divya On 17 October 2016 at 09:50, Mungeol Heo <mungeol....@gmail.com<mailto:mungeol....@gmail.com>> wrote: Hello, everyone. As I mentioned at the tile, I wonder that is spark a right tool for updating a data frame repeatedly until there is no more date to update. For example. while (if there was a updating) { update a data frame A } If it is the right tool, then what is the best practice for this kind of work? Thank you. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> -- Thanks, David S.