Re: Access to live data of cached dataFrame
When you cache a dataframe, you actually cache a logical plan. That's why re-creating the dataframe doesn't work: Spark finds out the logical plan is cached and picks the cached data. You need to uncache the dataframe, or go back to the SQL way: df.createTempView("abc") spark.table("abc").cache() df.show // returns latest data. spark.table("abc").show // returns cached data. On Mon, May 20, 2019 at 3:33 AM Tomas Bartalos wrote: > I'm trying to re-read however I'm getting cached data (which is a bit > confusing). For re-read I'm issuing: > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count > > The cache seems to be global influencing also new dataframes. > > So the question is how should I re-read without loosing the cached data > (without using unpersist) ? > > As I mentioned with sql its possible - I can create a cached view, so wen > I access the original table I get live data, when I access the view I get > cached data. > > BR, > Tomas > > On Fri, 17 May 2019, 8:57 pm Sean Owen, wrote: > >> A cached DataFrame isn't supposed to change, by definition. >> You can re-read each time or consider setting up a streaming source on >> the table which provides a result that updates as new data comes in. >> >> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos >> wrote: >> > >> > Hello, >> > >> > I have a cached dataframe: >> > >> > >> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache >> > >> > I would like to access the "live" data for this data frame without >> deleting the cache (using unpersist()). Whatever I do I always get the >> cached data on subsequent queries. Even adding new column to the query >> doesn't help: >> > >> > >> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy", >> lit("dummy")) >> > >> > >> > I'm able to workaround this using cached sql view, but I couldn't find >> a pure dataFrame solution. >> > >> > Thank you, >> > Tomas >> >
Re: Access to live data of cached dataFrame
I'm trying to re-read however I'm getting cached data (which is a bit confusing). For re-read I'm issuing: spark.read.format("delta").load("/data").groupBy(col("event_hour")).count The cache seems to be global influencing also new dataframes. So the question is how should I re-read without loosing the cached data (without using unpersist) ? As I mentioned with sql its possible - I can create a cached view, so wen I access the original table I get live data, when I access the view I get cached data. BR, Tomas On Fri, 17 May 2019, 8:57 pm Sean Owen, wrote: > A cached DataFrame isn't supposed to change, by definition. > You can re-read each time or consider setting up a streaming source on > the table which provides a result that updates as new data comes in. > > On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos > wrote: > > > > Hello, > > > > I have a cached dataframe: > > > > > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache > > > > I would like to access the "live" data for this data frame without > deleting the cache (using unpersist()). Whatever I do I always get the > cached data on subsequent queries. Even adding new column to the query > doesn't help: > > > > > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy", > lit("dummy")) > > > > > > I'm able to workaround this using cached sql view, but I couldn't find a > pure dataFrame solution. > > > > Thank you, > > Tomas >
Re: Access to live data of cached dataFrame
A cached DataFrame isn't supposed to change, by definition. You can re-read each time or consider setting up a streaming source on the table which provides a result that updates as new data comes in. On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos wrote: > > Hello, > > I have a cached dataframe: > > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache > > I would like to access the "live" data for this data frame without deleting > the cache (using unpersist()). Whatever I do I always get the cached data on > subsequent queries. Even adding new column to the query doesn't help: > > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy", > lit("dummy")) > > > I'm able to workaround this using cached sql view, but I couldn't find a pure > dataFrame solution. > > Thank you, > Tomas - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org