Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Mich Talebzadeh
Well that versioned table is CDC trail files that are landed on an external storage as immutable data. What happens is that you read the table itself at time T0 and then keep reading committed transaction changes as trail files. Kafka can do that as well. Read the files (CDC changes from say

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Eric Beabes
I was looking for something like "Processing Time Temporal Join" in Flink as described here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/joins.html#processing-time-temporal-join On Mon, May 3, 2021 at 11:07 AM Mich Talebzadeh wrote: > You are welcome Yuri.

[Spark Catalog API] Support for metadata Backup/Restore

2021-05-03 Thread Tianchen Zhang
Hi all, Currently the user-facing Catalog API doesn't support backup/restore metadata. Our customers are asking for such functionalities. Here is a usage example: 1. Read all metadata of one Spark cluster 2. Save them into a Parquet file on DFS 3. Read the Parquet file and restore all metadata in

[no subject]

2021-05-03 Thread Tianchen Zhang
Hi all, Currently the user-facing Catalog API doesn't support backup/restore metadata. Our customers are asking for such functionalities. Here is a usage example: 1. Read all metadata of one Spark cluster 2. Save them into a Parquet file on DFS 3. Read the Parquet file and restore all metadata in

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Mich Talebzadeh
You are welcome Yuri. However, I stand corrected :) view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Always nice to learn something new about jdbc. Thanks, Mich **thumbsup** > On 3 May 2021, at 20:54, Mich Talebzadeh wrote: > >  > i would have assumed that reference data like device_id are pretty static so > a snapshot will do. > > JDBC connection is lazy so it will not materialise until

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Mich Talebzadeh
i would have assumed that reference data like device_id are pretty static so a snapshot will do. JDBC connection is lazy so it will not materialise until the join uses it. Then data will be collected from the underlying RDBMS table for COMMITED transactions However, this is something that I

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
You can do the enrichment with stream(events)-static(device table) join when the device table is slow changing dimension (let’s say once a day change) and it’s in delta format, then for every micro batch with stream-static John the device table will be rescanned and up to date device data will

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Mich Talebzadeh
Let us look at your kafka streams. Say we just read them like below first read data from the topic schema = StructType().add("rowkey", StringType()).add("ticker", StringType()).add("timeissued", TimestampType()).add("price", FloatType()) try: # construct a streaming

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Eric Beabes
1) Device_id might be different for messages in a batch. 2) It's a Streaming application. The IOT messages are getting read in a Structured Streaming job in a "Stream". The Dataframe would need to be updated every hour. Have you done something similar in the past? Do you have an example to share?

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Mich Talebzadeh
Can you please clarify: 1. The IOT messages in one batch have the same device_id or every row has different device_id? 2. The RDBMS table can be read through JDBC in Spark and a dataframe can be created on. Does that work for you? You do not really need to stream the reference

Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Eric Beabes
I would like to develop a Spark Structured Streaming job that reads messages in a Stream which needs to be “joined” with another Stream of “Reference” data. For example, let’s say I’m reading messages from Kafka coming in from (lots of) IOT devices. This message has a ‘device_id’. We have a

Re: Broadcast Variable

2021-05-03 Thread Sean Owen
There is just one copy in memory. No different than if you have to variables pointing to the same dict. On Mon, May 3, 2021 at 7:54 AM Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi all, > > > > when broadcasting a large dict containing several million entries to > executors

Broadcast Variable

2021-05-03 Thread Bode, Meikel, NMA-CFD
Hi all, when broadcasting a large dict containing several million entries to executors what exactly happens when calling bc_var.value within a UDF like: .. d = bc_var.value .. Does d receives a copy of the dict inside value or is this handled like a pointer? Thanks, Meikel