If you do an action, most intermediate calculations would be gone for the next
iteration.
What I would do is persist every iteration, then after some (say 5) I would
write to disk and reload. At that point you should call unpersist to free the
memory as it is no longer relevant.
Thanks,
Assaf.
From: Saatvik Shah [mailto:[email protected]]
Sent: Tuesday, June 20, 2017 8:50 PM
To: Mendelson, Assaf
Cc: [email protected]
Subject: Re: Merging multiple Pandas dataframes
Hi Assaf,
Thanks for the suggestion on checkpointing - I'll need to read up more on that.
My current implementation seems to be crashing with a GC memory limit exceeded
error if Im keeping multiple persist calls for a large number of files.
Thus, I was also thinking about the constant calls to persist. Since all my
actions are Spark transformations(union of large number of Spark Dataframes
from Pandas dataframes), this entire process of building a large Spark
dataframe is essentially a huge transformation. Is it necessary to call persist
between unions? Shouldnt I instead wait for all the unions to complete and call
persist finally?
On Tue, Jun 20, 2017 at 2:52 AM, Mendelson, Assaf
<[email protected]<mailto:[email protected]>> wrote:
Note that depending on the number of iterations, the query plan for the
dataframe can become long and this can cause slowdowns (or even crashes).
A possible solution would be to checkpoint (or simply save and reload the
dataframe) every once in a while. When reloading from disk, the newly loaded
dataframe's lineage is just the disk...
Thanks,
Assaf.
-----Original Message-----
From: saatvikshah1994
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, June 20, 2017 2:22 AM
To: [email protected]<mailto:[email protected]>
Subject: Merging multiple Pandas dataframes
Hi,
I am iteratively receiving a file which can only be opened as a Pandas
dataframe. For the first such file I receive, I am converting this to a Spark
dataframe using the 'createDataframe' utility function. The next file onward, I
am converting it and union'ing it into the first Spark dataframe(the schema
always stays the same). After each union, I am persisting it in
memory(MEMORY_AND_DISK_ONLY level). After I have converted all such files to a
single spark dataframe I am coalescing it. Following some tips from this Stack
Overflow
post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions).
Any suggestions for optimizing this process further?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail:
[email protected]<mailto:[email protected]>
--
Saatvik Shah,
1st Year,
Masters in the School of Computer Science,
Carnegie Mellon University
https://saatvikshah1994.github.io/