I'm trying to re-read however I'm getting cached data (which is a bit confusing). For re-read I'm issuing: spark.read.format("delta").load("/data").groupBy(col("event_hour")).count
The cache seems to be global influencing also new dataframes. So the question is how should I re-read without loosing the cached data (without using unpersist) ? As I mentioned with sql its possible - I can create a cached view, so wen I access the original table I get live data, when I access the view I get cached data. BR, Tomas On Fri, 17 May 2019, 8:57 pm Sean Owen, <sro...@gmail.com> wrote: > A cached DataFrame isn't supposed to change, by definition. > You can re-read each time or consider setting up a streaming source on > the table which provides a result that updates as new data comes in. > > On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos <tomas.barta...@gmail.com> > wrote: > > > > Hello, > > > > I have a cached dataframe: > > > > > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache > > > > I would like to access the "live" data for this data frame without > deleting the cache (using unpersist()). Whatever I do I always get the > cached data on subsequent queries. Even adding new column to the query > doesn't help: > > > > > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy", > lit("dummy")) > > > > > > I'm able to workaround this using cached sql view, but I couldn't find a > pure dataFrame solution. > > > > Thank you, > > Tomas >