hi, I'm running pyspark on my local PC using the stand alone mode.
After a pyspark window function on a dataframe, I did a groupby query on the dataframe. The groupby query turns out to be very slow (10+ minutes on a small data set). I then cached the dataframe and re-ran the same query. The query remained very slow. I could also hear noises from the hard drive - I assume the PC is busy reading and writing from the hard drive. Is this an indication of the data frame has spilled over to hard drive? What's the best method for monitoring what's happening? How can I avoid this from happening? Thanks!