Note that depending on the number of iterations, the query plan for the
dataframe can become long and this can cause slowdowns (or even crashes).
A possible solution would be to checkpoint (or simply save and reload the
dataframe) every once in a while. When reloading from disk, the newly loaded
dataframe's lineage is just the disk...
Thanks,
Assaf.
-----Original Message-----
From: saatvikshah1994 [mailto:[email protected]]
Sent: Tuesday, June 20, 2017 2:22 AM
To: [email protected]
Subject: Merging multiple Pandas dataframes
Hi,
I am iteratively receiving a file which can only be opened as a Pandas
dataframe. For the first such file I receive, I am converting this to a Spark
dataframe using the 'createDataframe' utility function. The next file onward, I
am converting it and union'ing it into the first Spark dataframe(the schema
always stays the same). After each union, I am persisting it in
memory(MEMORY_AND_DISK_ONLY level). After I have converted all such files to a
single spark dataframe I am coalescing it. Following some tips from this Stack
Overflow
post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions).
Any suggestions for optimizing this process further?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]