Re: Access to live data of cached dataFrame

2019-05-21 Thread Wenchen Fan
When you cache a dataframe, you actually cache a logical plan. That's why
re-creating the dataframe doesn't work: Spark finds out the logical plan is
cached and picks the cached data.

You need to uncache the dataframe, or go back to the SQL way:
df.createTempView("abc")
spark.table("abc").cache()
df.show // returns latest data.
spark.table("abc").show // returns cached data.


On Mon, May 20, 2019 at 3:33 AM Tomas Bartalos 
wrote:

> I'm trying to re-read however I'm getting cached data (which is a bit
> confusing). For re-read I'm issuing:
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count
>
> The cache seems to be global influencing also new dataframes.
>
> So the question is how should I re-read without loosing the cached data
> (without using unpersist) ?
>
> As I mentioned with sql its possible - I can create a cached view, so wen
> I access the original table I get live data, when I access the view I get
> cached data.
>
> BR,
> Tomas
>
> On Fri, 17 May 2019, 8:57 pm Sean Owen,  wrote:
>
>> A cached DataFrame isn't supposed to change, by definition.
>> You can re-read each time or consider setting up a streaming source on
>> the table which provides a result that updates as new data comes in.
>>
>> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos 
>> wrote:
>> >
>> > Hello,
>> >
>> > I have a cached dataframe:
>> >
>> >
>> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
>> >
>> > I would like to access the "live" data for this data frame without
>> deleting the cache (using unpersist()). Whatever I do I always get the
>> cached data on subsequent queries. Even adding new column to the query
>> doesn't help:
>> >
>> >
>> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
>> lit("dummy"))
>> >
>> >
>> > I'm able to workaround this using cached sql view, but I couldn't find
>> a pure dataFrame solution.
>> >
>> > Thank you,
>> > Tomas
>>
>


Re: Access to live data of cached dataFrame

2019-05-19 Thread Tomas Bartalos
I'm trying to re-read however I'm getting cached data (which is a bit
confusing). For re-read I'm issuing:
spark.read.format("delta").load("/data").groupBy(col("event_hour")).count

The cache seems to be global influencing also new dataframes.

So the question is how should I re-read without loosing the cached data
(without using unpersist) ?

As I mentioned with sql its possible - I can create a cached view, so wen I
access the original table I get live data, when I access the view I get
cached data.

BR,
Tomas

On Fri, 17 May 2019, 8:57 pm Sean Owen,  wrote:

> A cached DataFrame isn't supposed to change, by definition.
> You can re-read each time or consider setting up a streaming source on
> the table which provides a result that updates as new data comes in.
>
> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos 
> wrote:
> >
> > Hello,
> >
> > I have a cached dataframe:
> >
> >
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
> >
> > I would like to access the "live" data for this data frame without
> deleting the cache (using unpersist()). Whatever I do I always get the
> cached data on subsequent queries. Even adding new column to the query
> doesn't help:
> >
> >
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
> lit("dummy"))
> >
> >
> > I'm able to workaround this using cached sql view, but I couldn't find a
> pure dataFrame solution.
> >
> > Thank you,
> > Tomas
>


Re: Access to live data of cached dataFrame

2019-05-17 Thread Sean Owen
A cached DataFrame isn't supposed to change, by definition.
You can re-read each time or consider setting up a streaming source on
the table which provides a result that updates as new data comes in.

On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos  wrote:
>
> Hello,
>
> I have a cached dataframe:
>
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
>
> I would like to access the "live" data for this data frame without deleting 
> the cache (using unpersist()). Whatever I do I always get the cached data on 
> subsequent queries. Even adding new column to the query doesn't help:
>
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
>  lit("dummy"))
>
>
> I'm able to workaround this using cached sql view, but I couldn't find a pure 
> dataFrame solution.
>
> Thank you,
> Tomas

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org