Two dataframes do not share cache storage in Spark. Hence it's immaterial that how two dataFrames are related to each other. Both of them are going to consume memory based on the data that they have. So for your A1 and B1 you would need extra memory that would be equivalent to half the memory of A/B.
You can check the storage that a dataFrame is consuming in the Spark UI's Storage tab. http://host:4040/storage/ On Thu, Sep 24, 2015 at 5:37 AM, Zhang, Jingyu <jingyu.zh...@news.com.au> wrote: > I have A and B DataFrames > A has columns a11,a12, a21,a22 > B has columns b11,b12, b21,b22 > > I persistent them in cache > 1. A.Cache(), > 2. B.Cache() > > Then, I persistent the subset in cache later > > 3. DataFrame A1 (a11,a12).cache() > > 4. DataFrame B1 (b11,b12).cache() > > 5. DataFrame AB1 (a11,a12,b11,b12).cahce() > > Can you please tell me what happen for caching case (3,4, and 5) after A > and B cached? > How much more memory do I need compare with Caching 1 and 2 only? > > Thanks > > Jingyu > > This message and its attachments may contain legally privileged or > confidential information. It is intended solely for the named addressee. If > you are not the addressee indicated in this message or responsible for > delivery of the message to the addressee, you may not copy or deliver this > message or its attachments to anyone. Rather, you should permanently delete > this message and its attachments and kindly notify the sender by reply > e-mail. Any content of this message and its attachments which does not > relate to the official business of the sending company must be taken not to > have been sent or endorsed by that company or any of its related entities. > No warranty is made that the e-mail or attachments are free from computer > virus or other defect.