Re: Refreshing a persisted RDD

2017-05-19 Thread Sudhir Menon
e parsed the kafka payload and have a data frame here with
>> multiple columns.. one of them called accountName
>>
>> dfAccount. createOrReplaceTempView(“account”)
>>
>> val dfBlackListedAccount = spark.sql(“select * from account inner join
>> blacklist on account.accountName = blacklist.accountName”)
>>
>> df.writeStream(…..).start() // boom started
>>
>>
>>
>> Now some time later while the query is running we do
>>
>>
>>
>> val dfRefreshedBlackList = spark.read.csv(….)
>> dfRefreshedBlackList.createOrReplaceTempView(“blacklist”)
>>
>>
>>
>> Now, will dfBlackListedAccount pick up the newly created blacklist? Or
>> will it continue to hold the reference to the old dataframe? What if we had
>> done RDD operations instead of using Spark SQL to join the dataframes?
>>
>>
>>
>> *From: *Tathagata Das <tathagata.das1...@gmail.com>
>> *Date: *Wednesday, May 3, 2017 at 6:32 PM
>> *To: *"Lalwani, Jayesh" <jayesh.lalw...@capitalone.com>
>> *Cc: *user <user@spark.apache.org>
>> *Subject: *Re: Refreshing a persisted RDD
>>
>>
>>
>> If you want to always get the latest data in files, its best to always
>> recreate the DataFrame.
>>
>>
>>
>> On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani <
>> jayesh.lalw...@capitalone.com> wrote:
>>
>> We have a Structured Streaming application that gets accounts from Kafka
>> into
>> a streaming data frame. We have a blacklist of accounts stored in S3 and
>> we
>> want to filter out all the accounts that are blacklisted. So, we are
>> loading
>> the blacklisted accounts into a batch data frame and joining it with the
>> streaming data frame to filter out the bad accounts.
>> Now, the blacklist doesn't change very often.. once a week at max. SO, we
>> wanted to cache the blacklist data frame to prevent going out to S3
>> everytime. Since, the blacklist might change, we want to be able to
>> refresh
>> the cache at a cadence, without restarting the whole app.
>> So, to begin with we wrote a simple app that caches and refreshes a simple
>> data frame. The steps we followed are
>> /Create a CSV file
>> load CSV into a DF: df = spark.read.csv(filename)
>> Persist the data frame: df.persist
>> Now when we do df.show, we see the contents of the csv.
>> We change the CSV, and call df.show, we can see that the old contents are
>> being displayed, proving that the df is cached
>> df.unpersist
>> df.persist
>> df.show/
>>
>> What we see is that the rows that were modified in the CSV are reloaded..
>> But new rows aren't
>> Is this expected behavior? Is there a better way to refresh cached data
>> without restarting the Spark application?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>> --
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>
>


-- 
Suds
snappydata.io
503-724-1481 (c)
Real time operational analytics at the speed of thought
Checkout the SnappyData iSight cloud for AWS


Re: Refreshing a persisted RDD

2017-05-03 Thread Tathagata Das
Yes, you will have to recreate the streaming Dataframe along with the
static Dataframe, and restart the query. There isnt a currently feasible to
do this without a query restart. But restarting a query WITHOUT restarting
the whole application + spark cluster, is reasonably fast. If your
applicatoin can tolerate 10 second latencies, then stopping and restarting
a query within the same Spark application is a reasonable solution.

On Wed, May 3, 2017 at 4:13 PM, Lalwani, Jayesh <
jayesh.lalw...@capitalone.com> wrote:

> Thanks, TD for answering this question on the Spark mailing list.
>
>
>
> A follow-up. So, let’s say we are joining a cached dataframe with a
> streaming dataframe, and we recreate the cached dataframe, do we have to
> recreate the streaming dataframe too?
>
>
>
> One possible solution that we have is
>
>
>
> val dfBlackList = spark.read.csv(….) //batch dataframe… assume that this
> dataframe has a single column namedAccountName
> dfBlackList.createOrReplaceTempView(“blacklist”)
> val dfAccount = spark.readStream.kafka(…..) // assume for brevity’s sake
> that we have parsed the kafka payload and have a data frame here with
> multiple columns.. one of them called accountName
>
> dfAccount. createOrReplaceTempView(“account”)
>
> val dfBlackListedAccount = spark.sql(“select * from account inner join
> blacklist on account.accountName = blacklist.accountName”)
>
> df.writeStream(…..).start() // boom started
>
>
>
> Now some time later while the query is running we do
>
>
>
> val dfRefreshedBlackList = spark.read.csv(….)
> dfRefreshedBlackList.createOrReplaceTempView(“blacklist”)
>
>
>
> Now, will dfBlackListedAccount pick up the newly created blacklist? Or
> will it continue to hold the reference to the old dataframe? What if we had
> done RDD operations instead of using Spark SQL to join the dataframes?
>
>
>
> *From: *Tathagata Das <tathagata.das1...@gmail.com>
> *Date: *Wednesday, May 3, 2017 at 6:32 PM
> *To: *"Lalwani, Jayesh" <jayesh.lalw...@capitalone.com>
> *Cc: *user <user@spark.apache.org>
> *Subject: *Re: Refreshing a persisted RDD
>
>
>
> If you want to always get the latest data in files, its best to always
> recreate the DataFrame.
>
>
>
> On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani <
> jayesh.lalw...@capitalone.com> wrote:
>
> We have a Structured Streaming application that gets accounts from Kafka
> into
> a streaming data frame. We have a blacklist of accounts stored in S3 and we
> want to filter out all the accounts that are blacklisted. So, we are
> loading
> the blacklisted accounts into a batch data frame and joining it with the
> streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we
> wanted to cache the blacklist data frame to prevent going out to S3
> everytime. Since, the blacklist might change, we want to be able to refresh
> the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple
> data frame. The steps we followed are
> /Create a CSV file
> load CSV into a DF: df = spark.read.csv(filename)
> Persist the data frame: df.persist
> Now when we do df.show, we see the contents of the csv.
> We change the CSV, and call df.show, we can see that the old contents are
> being displayed, proving that the df is cached
> df.unpersist
> df.persist
> df.show/
>
> What we see is that the rows that were modified in the CSV are reloaded..
> But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data
> without restarting the Spark application?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: Refreshing a persisted RDD

2017-05-03 Thread Lalwani, Jayesh
Thanks, TD for answering this question on the Spark mailing list.

A follow-up. So, let’s say we are joining a cached dataframe with a streaming 
dataframe, and we recreate the cached dataframe, do we have to recreate the 
streaming dataframe too?

One possible solution that we have is

val dfBlackList = spark.read.csv(….) //batch dataframe… assume that this 
dataframe has a single column namedAccountName
dfBlackList.createOrReplaceTempView(“blacklist”)
val dfAccount = spark.readStream.kafka(…..) // assume for brevity’s sake that 
we have parsed the kafka payload and have a data frame here with multiple 
columns.. one of them called accountName
dfAccount. createOrReplaceTempView(“account”)
val dfBlackListedAccount = spark.sql(“select * from account inner join 
blacklist on account.accountName = blacklist.accountName”)
df.writeStream(…..).start() // boom started

Now some time later while the query is running we do

val dfRefreshedBlackList = spark.read.csv(….)
dfRefreshedBlackList.createOrReplaceTempView(“blacklist”)

Now, will dfBlackListedAccount pick up the newly created blacklist? Or will it 
continue to hold the reference to the old dataframe? What if we had done RDD 
operations instead of using Spark SQL to join the dataframes?

From: Tathagata Das <tathagata.das1...@gmail.com>
Date: Wednesday, May 3, 2017 at 6:32 PM
To: "Lalwani, Jayesh" <jayesh.lalw...@capitalone.com>
Cc: user <user@spark.apache.org>
Subject: Re: Refreshing a persisted RDD

If you want to always get the latest data in files, its best to always recreate 
the DataFrame.

On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani 
<jayesh.lalw...@capitalone.com<mailto:jayesh.lalw...@capitalone.com>> wrote:
We have a Structured Streaming application that gets accounts from Kafka into
a streaming data frame. We have a blacklist of accounts stored in S3 and we
want to filter out all the accounts that are blacklisted. So, we are loading
the blacklisted accounts into a batch data frame and joining it with the
streaming data frame to filter out the bad accounts.
Now, the blacklist doesn't change very often.. once a week at max. SO, we
wanted to cache the blacklist data frame to prevent going out to S3
everytime. Since, the blacklist might change, we want to be able to refresh
the cache at a cadence, without restarting the whole app.
So, to begin with we wrote a simple app that caches and refreshes a simple
data frame. The steps we followed are
/Create a CSV file
load CSV into a DF: df = spark.read.csv(filename)
Persist the data frame: df.persist
Now when we do df.show, we see the contents of the csv.
We change the CSV, and call df.show, we can see that the old contents are
being displayed, proving that the df is cached
df.unpersist
df.persist
df.show/

What we see is that the rows that were modified in the CSV are reloaded..
But new rows aren't
Is this expected behavior? Is there a better way to refresh cached data
without restarting the Spark application?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Refreshing a persisted RDD

2017-05-03 Thread Tathagata Das
If you want to always get the latest data in files, its best to always
recreate the DataFrame.

On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani <jayesh.lalw...@capitalone.com
> wrote:

> We have a Structured Streaming application that gets accounts from Kafka
> into
> a streaming data frame. We have a blacklist of accounts stored in S3 and we
> want to filter out all the accounts that are blacklisted. So, we are
> loading
> the blacklisted accounts into a batch data frame and joining it with the
> streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we
> wanted to cache the blacklist data frame to prevent going out to S3
> everytime. Since, the blacklist might change, we want to be able to refresh
> the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple
> data frame. The steps we followed are
> /Create a CSV file
> load CSV into a DF: df = spark.read.csv(filename)
> Persist the data frame: df.persist
> Now when we do df.show, we see the contents of the csv.
> We change the CSV, and call df.show, we can see that the old contents are
> being displayed, proving that the df is cached
> df.unpersist
> df.persist
> df.show/
>
> What we see is that the rows that were modified in the CSV are reloaded..
> But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data
> without restarting the Spark application?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Refreshing a persisted RDD

2017-05-03 Thread JayeshLalwani
We have a Structured Streaming application that gets accounts from Kafka into
a streaming data frame. We have a blacklist of accounts stored in S3 and we
want to filter out all the accounts that are blacklisted. So, we are loading
the blacklisted accounts into a batch data frame and joining it with the
streaming data frame to filter out the bad accounts.
Now, the blacklist doesn't change very often.. once a week at max. SO, we
wanted to cache the blacklist data frame to prevent going out to S3
everytime. Since, the blacklist might change, we want to be able to refresh
the cache at a cadence, without restarting the whole app.
So, to begin with we wrote a simple app that caches and refreshes a simple
data frame. The steps we followed are
/Create a CSV file
load CSV into a DF: df = spark.read.csv(filename)
Persist the data frame: df.persist
Now when we do df.show, we see the contents of the csv.
We change the CSV, and call df.show, we can see that the old contents are
being displayed, proving that the df is cached
df.unpersist
df.persist
df.show/

What we see is that the rows that were modified in the CSV are reloaded..
But new rows aren't
Is this expected behavior? Is there a better way to refresh cached data
without restarting the Spark application?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org