Idea #2 probably suits my needs better, because

-          Streaming query does not have a source database connector yet

-          My source database table is big, so in-memory table could be huge 
for driver to handle.

Thanks for cool ideas, TD!

Regards,
Hemanth

From: Tathagata Das <tathagata.das1...@gmail.com>
Date: Friday, 21 April 2017 at 0.03
To: Hemanth Gudela <hemanth.gud...@qvantel.com>
Cc: Georg Heiler <georg.kf.hei...@gmail.com>, "user@spark.apache.org" 
<user@spark.apache.org>
Subject: Re: Spark structured streaming: Is it possible to periodically refresh 
static data frame?

Here are couple of ideas.
1. You can set up a Structured Streaming query to update in-memory table.
Look at the memory sink in the programming guide - 
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
So you can query the latest table using a specified table name, and also join 
that table with another stream. However, note that this in-memory table is 
maintained in the driver, and so you have be careful about the size of the 
table.

2. If you cannot define a streaming query in the slow moving due to 
unavailability of connector for your streaming data source, then you can always 
define a batch Dataframe and register it as view, and then run a background 
then periodically creates a new Dataframe with updated data and re-registers it 
as a view with the same name. Any streaming query that joins a streaming 
dataframe with the view will automatically start using the most updated data as 
soon as the view is updated.

Hope this helps.


On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela 
<hemanth.gud...@qvantel.com<mailto:hemanth.gud...@qvantel.com>> wrote:
Thanks Georg for your reply.
But I’m not sure if I fully understood your answer.

If you meant to join two streams (one reading Kafka, and another reading 
database table), then I think it’s not possible, because

1.       According to 
documentation<http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>,
 Structured streaming does not support database as a streaming source

2.       Joining between two streams is not possible yet.

Regards,
Hemanth

From: Georg Heiler <georg.kf.hei...@gmail.com<mailto:georg.kf.hei...@gmail.com>>
Date: Thursday, 20 April 2017 at 23.11
To: Hemanth Gudela 
<hemanth.gud...@qvantel.com<mailto:hemanth.gud...@qvantel.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Spark structured streaming: Is it possible to periodically refresh 
static data frame?

What about treating the static data as a (slow) stream as well?

Hemanth Gudela <hemanth.gud...@qvantel.com<mailto:hemanth.gud...@qvantel.com>> 
schrieb am Do., 20. Apr. 2017 um 22:09 Uhr:
Hello,

I am working on a use case where there is a need to join streaming data frame 
with a static data frame.
The streaming data frame continuously gets data from Kafka topics, whereas 
static data frame fetches data from a database table.

However, as the underlying database table is getting updated often, I must 
somehow manage to refresh my static data frame periodically to get the latest 
information from underlying database table.

My questions:

1.       Is it possible to periodically refresh static data frame?

2.       If refreshing static data frame is not possible, is there a mechanism 
to automatically stop & restarting spark structured streaming job, so that 
every time the job restarts, the static data frame gets updated with latest 
information from underlying database table.

3.       If 1) and 2) are not possible, please suggest alternatives to achieve 
my requirement described above.

Thanks,
Hemanth

Reply via email to