Yes, iterating over a dataframe and making changes is not uncommon.
Ofcourse RDDs, dataframes and datasets are immultable, but there is some 
optimization in the optimizer that can potentially help to dampen the 
effect/impact of creating a new rdd, df or ds.
Also, the use-case you cited is similar to what is done in regression, 
clustering and other algorithms.
I.e. you iterate making a change to a dataframe/dataset until the desired 
condition.
E.g. see this - 
https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression
 and the setting of the iteration ceiling

// instantiate the base classifier
val classifier = new LogisticRegression()
  .setMaxIter(params.maxIter)
  .setTol(params.tol)
  .setFitIntercept(params.fitIntercept)

Now the impact of that depends on a variety of things.
E.g. if the data is completely contained in memory and there is no spill over 
to disk, it might not be a big issue (ofcourse there will still be memory, CPU 
and network overhead/latency).
If you are looking at storing the data on disk (e.g. as part of a checkpoint or 
explicit storage), then there can be substantial I/O activity.



From: Xi Shen <davidshe...@gmail.com>
Date: Monday, October 17, 2016 at 2:54 AM
To: Divya Gehlot <divya.htco...@gmail.com>, Mungeol Heo <mungeol....@gmail.com>
Cc: "user @spark" <user@spark.apache.org>
Subject: Re: Is spark a right tool for updating a dataframe repeatedly

I think most of the "big data" tools, like Spark and Hive, are not designed to 
edit data. They are only designed to query data. I wonder in what scenario you 
need to update large volume of data repetitively.


On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
<divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> wrote:
If  my understanding is correct about your query
In spark Dataframes are immutable , cant update the dataframe.
you have to create a new dataframe to update the current dataframe .


Thanks,
Divya


On 17 October 2016 at 09:50, Mungeol Heo 
<mungeol....@gmail.com<mailto:mungeol....@gmail.com>> wrote:
Hello, everyone.

As I mentioned at the tile, I wonder that is spark a right tool for
updating a data frame repeatedly until there is no more date to
update.

For example.

while (if there was a updating) {
update a data frame A
}

If it is the right tool, then what is the best practice for this kind of work?
Thank you.

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

--

Thanks,
David S.

Reply via email to