Caching dataframes and overwrite

2017-11-21 Thread Michael Artz
I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs. There is a very similar post to

caching DataFrames

2015-09-23 Thread Zhang, Jingyu
I have A and B DataFrames A has columns a11,a12, a21,a22 B has columns b11,b12, b21,b22 I persistent them in cache 1. A.Cache(), 2. B.Cache() Then, I persistent the subset in cache later 3. DataFrame A1 (a11,a12).cache() 4. DataFrame B1 (b11,b12).cache() 5. DataFrame AB1

Re: caching DataFrames

2015-09-23 Thread Hemant Bhanawat
Two dataframes do not share cache storage in Spark. Hence it's immaterial that how two dataFrames are related to each other. Both of them are going to consume memory based on the data that they have. So for your A1 and B1 you would need extra memory that would be equivalent to half the memory of

Re: caching DataFrames

2015-09-23 Thread Zhang, Jingyu
Thanks Hemant, I will generate a total report (dfA) with many columns from log data. After the report (A) done. I will generate many detail reports (dfA1-dfAi) base on the subset of the total report (dfA), those detail reports using aggregate and window functions, according on different rules.