Re: Merging multiple Pandas dataframes

2017-06-22 Thread Saatvik Shah
ikshah1...@gmail.com] > *Sent:* Tuesday, June 20, 2017 8:50 PM > *To:* Mendelson, Assaf > *Cc:* user@spark.apache.org > *Subject:* Re: Merging multiple Pandas dataframes > > > > Hi Assaf, > > Thanks for the suggestion on checkpointing - I'll need to read up more

RE: Merging multiple Pandas dataframes

2017-06-21 Thread Mendelson, Assaf
, Assaf. From: Saatvik Shah [mailto:saatvikshah1...@gmail.com] Sent: Tuesday, June 20, 2017 8:50 PM To: Mendelson, Assaf Cc: user@spark.apache.org Subject: Re: Merging multiple Pandas dataframes Hi Assaf, Thanks for the suggestion on checkpointing - I'll need to read up more on that. My

Re: Merging multiple Pandas dataframes

2017-06-20 Thread Saatvik Shah
Hi Assaf, Thanks for the suggestion on checkpointing - I'll need to read up more on that. My current implementation seems to be crashing with a GC memory limit exceeded error if Im keeping multiple persist calls for a large number of files. Thus, I was also thinking about the constant calls to

RE: Merging multiple Pandas dataframes

2017-06-20 Thread Mendelson, Assaf
Note that depending on the number of iterations, the query plan for the dataframe can become long and this can cause slowdowns (or even crashes). A possible solution would be to checkpoint (or simply save and reload the dataframe) every once in a while. When reloading from disk, the newly loaded