You could write your views to hive or maybe tachyon.

Is the periodically updated data big?
Hemanth Gudela <hemanth.gud...@qvantel.com> schrieb am Fr. 21. Apr. 2017 um
16:55:

> Being new to spark, I think I need your suggestion again.
>
>
>
> #2 you can always define a batch Dataframe and register it as view, and
> then run a background then periodically creates a new Dataframe with
> updated data and re-registers it as a view with the same name
>
>
>
> I seem to have misunderstood your statement and tried registering static
> dataframe as a temp view (“myTempView”) using createOrReplaceView in one
> spark session, and tried re-registering another refreshed dataframe as temp
> view with same name (“myTempView”) in another session. However, with this
> approach, I have failed to achieve what I’m aiming for, because views are
> local to one spark session.
>
> From spark 2.1.0 onwards, Global view is a nice feature, but still would
> not solve my problem, because global view cannot be updated.
>
>
>
> So after much thinking, I understood that you would have meant to use a
> background running process in the same spark job that would periodically
> create a new dataframe and re-register temp view with same name, within the
> same spark session.
>
> Could you please give me some pointers to documentation on how to create
> such asynchronous background process in spark streaming? Is Scala’s
> “Futures” the way to achieve this?
>
>
>
> Thanks,
>
> Hemanth
>
>
>
>
>
> *From: *Tathagata Das <tathagata.das1...@gmail.com>
>
>
> *Date: *Friday, 21 April 2017 at 0.03
> *To: *Hemanth Gudela <hemanth.gud...@qvantel.com>
>
> *Cc: *Georg Heiler <georg.kf.hei...@gmail.com>, "user@spark.apache.org" <
> user@spark.apache.org>
>
>
> *Subject: *Re: Spark structured streaming: Is it possible to periodically
> refresh static data frame?
>
>
>
> Here are couple of ideas.
>
> 1. You can set up a Structured Streaming query to update in-memory table.
>
> Look at the memory sink in the programming guide -
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
>
> So you can query the latest table using a specified table name, and also
> join that table with another stream. However, note that this in-memory
> table is maintained in the driver, and so you have be careful about the
> size of the table.
>
>
>
> 2. If you cannot define a streaming query in the slow moving due to
> unavailability of connector for your streaming data source, then you can
> always define a batch Dataframe and register it as view, and then run a
> background then periodically creates a new Dataframe with updated data and
> re-registers it as a view with the same name. Any streaming query that
> joins a streaming dataframe with the view will automatically start using
> the most updated data as soon as the view is updated.
>
>
>
> Hope this helps.
>
>
>
>
>
> On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <
> hemanth.gud...@qvantel.com> wrote:
>
> Thanks Georg for your reply.
>
> But I’m not sure if I fully understood your answer.
>
>
>
> If you meant to join two streams (one reading Kafka, and another reading
> database table), then I think it’s not possible, because
>
> 1.       According to documentation
> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>,
> Structured streaming does not support database as a streaming source
>
> 2.       Joining between two streams is not possible yet.
>
>
>
> Regards,
>
> Hemanth
>
>
>
> *From: *Georg Heiler <georg.kf.hei...@gmail.com>
> *Date: *Thursday, 20 April 2017 at 23.11
> *To: *Hemanth Gudela <hemanth.gud...@qvantel.com>, "user@spark.apache.org"
> <user@spark.apache.org>
> *Subject: *Re: Spark structured streaming: Is it possible to periodically
> refresh static data frame?
>
>
>
> What about treating the static data as a (slow) stream as well?
>
>
>
> Hemanth Gudela <hemanth.gud...@qvantel.com> schrieb am Do., 20. Apr. 2017
> um 22:09 Uhr:
>
> Hello,
>
>
>
> I am working on a use case where there is a need to join streaming data
> frame with a static data frame.
>
> The streaming data frame continuously gets data from Kafka topics, whereas
> static data frame fetches data from a database table.
>
>
>
> However, as the underlying database table is getting updated often, I must
> somehow manage to refresh my static data frame periodically to get the
> latest information from underlying database table.
>
>
>
> My questions:
>
> 1.       Is it possible to periodically refresh static data frame?
>
> 2.       If refreshing static data frame is not possible, is there a
> mechanism to automatically stop & restarting spark structured streaming
> job, so that every time the job restarts, the static data frame gets
> updated with latest information from underlying database table.
>
> 3.       If 1) and 2) are not possible, please suggest alternatives to
> achieve my requirement described above.
>
>
>
> Thanks,
>
> Hemanth
>
>
>

Reply via email to