I have been interested in finding out why I am getting strange behavior
when running a certain spark job. The job will error out if I place an
action (A .show(1) method) either right after caching the DataFrame or
right before writing the dataframe back to hdfs. There is a very similar
post to
I have A and B DataFrames
A has columns a11,a12, a21,a22
B has columns b11,b12, b21,b22
I persistent them in cache
1. A.Cache(),
2. B.Cache()
Then, I persistent the subset in cache later
3. DataFrame A1 (a11,a12).cache()
4. DataFrame B1 (b11,b12).cache()
5. DataFrame AB1
Two dataframes do not share cache storage in Spark. Hence it's immaterial
that how two dataFrames are related to each other. Both of them are going
to consume memory based on the data that they have. So for your A1 and B1
you would need extra memory that would be equivalent to half the memory of
Thanks Hemant,
I will generate a total report (dfA) with many columns from log data. After
the report (A) done. I will generate many detail reports (dfA1-dfAi) base
on the subset of the total report (dfA), those detail reports using
aggregate and window functions, according on different rules.