Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-48117> as a feature request.
Let me think about another way and revert HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Mon, 6 May 2024 at 07:54, Karthick Nk <kcekarth...@gmail.com> wrote: > Thanks Mich, > > can you please confirm me is my understanding correct? > > First, we have to create the materialized view based on the mapping > details we have by using multiple tables as source(since we have multiple > join condition from different tables). From the materialised view we can > stream the view data into elastic index by using cdc? > > Thanks in advance. > > On Fri, May 3, 2024 at 3:39 PM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> My recommendation! is using materialized views (MVs) created in Hive with >> Spark Structured Streaming and Change Data Capture (CDC) is a good >> combination for efficiently streaming view data updates in your scenario. >> >> HTH >> >> Mich Talebzadeh, >> Technologist | Architect | Data Engineer | Generative AI | FinCrime >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* The information provided is correct to the best of my >> knowledge but of course cannot be guaranteed . It is essential to note >> that, as with any advice, quote "one test result is worth one-thousand >> expert opinions (Werner >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >> >> >> On Thu, 2 May 2024 at 21:25, Karthick Nk <kcekarth...@gmail.com> wrote: >> >>> Hi All, >>> >>> Requirements: >>> I am working on the data flow, which will use the view definition(view >>> definition already defined in schema), there are multiple tables used in >>> the view definition. Here we want to stream the view data into elastic >>> index based on if any of the table(used in the view definition) data got >>> changed. >>> >>> >>> Current flow: >>> 1. we are inserting id's from the table(which used in the view >>> definition) into the common table. >>> 2. From the common table by using the id, we will be streaming the view >>> data (by using if any of the incomming id is present in the collective id >>> of all tables used from view definition) by using spark structured >>> streaming. >>> >>> >>> Issue: >>> 1. Here we are facing issue - For each incomming id here we running view >>> definition(so it will read all the data from all the data) and check if any >>> of the incomming id is present in the collective id's of view result, Due >>> to which it is taking more memory in the cluster driver and taking more >>> time to process. >>> >>> >>> I am epxpecting an alternate solution, if we can avoid full scan of view >>> definition every time, If you have any alternate deisgn flow how we can >>> achieve the result, please suggest for the same. >>> >>> >>> Note: Also, it will be helpfull, if you can share the details like >>> community forum or platform to discuss this kind of deisgn related topics, >>> it will be more helpfull. >>> >>