[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370768#comment-17370768 ] Zheng Shao commented on SPARK-29038: [~cltlfcjin] and [~AidenZhang]. I also recently started to look at materialized views. This is a huge opportunity for us to improve query performance. It has been almost a year since the last update. Are there any new updates from your side? > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153170#comment-17153170 ] AidenZhang commented on SPARK-29038: Hi [~cltlfcjin],Thanks for you reply The situation is that Recently our company are about to implement materialized view in sparkSQL,we are going to optimize catalyst to support query rewrite,and replace table using materialized view if applicable,The corresponding data of materialized view is stored on HDFS, and the structure information of materialized view is stored in hive metastore,Our plan is to implement materialized view management of spark SQL based on hive.There are two people in our team now. could you please evaluate how long it will take to implement this function? > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152738#comment-17152738 ] Lantao Jin commented on SPARK-29038: Hi [~AidenZhang], my focusings of MV in recent months are two parts. One is the rewrite algothim optimization. Such as forbidding count distict post aggregation, avoid unnecessary rewrite when do relation replacement. Another is bugfix in MV refresh. Use a Spark listener to deliver the metastore events to refresh. Some parts depends on third part system. So maybe only interfaces are available in community Spark. I don't do the partial/incremental refresh since it's not a blocker for us. I am not sure the community are still interested the feature, but we are moving existing implementation to Spark3.0 now. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152592#comment-17152592 ] AidenZhang commented on SPARK-29038: Hi [~cltlfcjin] Recently I've also been working on how spark SQL supports materialized views ,I wanna know if your plan has been completed? If you have already finished the plan ,could you please share your latest design documents, Looking forward to your early reply,thank you ! > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938374#comment-16938374 ] Lantao Jin commented on SPARK-29038: [~smilegator], yes, parquet does not good for incremental refresh. Current our implementation is based on parquet with entire refresh. But it is not strongly coupled to parquet. Delta would be an option if enable incremental refresh. I will investigate it. [~amargoor] I didn't see any SPIP ticket for it. But I personally +1 for letting Catalyst to recognizing Hive materialized views. I can help to review when it's done. In the meantime, I think it doesn't conflict with building a Spark native materialized view. We have enable this feature on production. Hope could get more inputs from community about this. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938134#comment-16938134 ] Amogh Margoor commented on SPARK-29038: --- Hi [~cltlfcjin] We have written code around Catalyst recognizing Hive Materialized views and substituting it here: [https://github.com/apache/spark/pull/25773/files. |https://github.com/apache/spark/pull/25773/files]We would love to collaborate if you are planning to create a Spark native Materialized Views as we believe lot of code that we wrote can be used especially the optimizer part of it. Do let us know your thoughts. cc [~karup1990] > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935119#comment-16935119 ] Xiao Li commented on SPARK-29038: - Building it using parquet does not perform well for incremental refresh, since parquet does not support update/delete/merge. Also, parquet does not guarantee the ACID. Thus, I would suggest using Delta-like data source to implement it. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928144#comment-16928144 ] Lantao Jin commented on SPARK-29038: [~smilegator] Yes. It's physically stored. I will create a detail documentation which contains more details to illustrate the implementation. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928100#comment-16928100 ] Adrian Wang commented on SPARK-29038: - This seems duplicates with our proposal of SPARK-26764 . We have implemented similar features and have already had it running in our customer's production environment. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927784#comment-16927784 ] Xiao Li commented on SPARK-29038: - So far, the doc does not contain enough details. It requires comprehensive comparison with the corresponding features in the other commercial database. We also need to document how to implement them one by one. Also, based on my understanding, the materialized view should not be memory-based. It has to be physically stored. Usage of Spark cache could affect the other memory-intensive queries. Any major feature in cache usage requires a memory manager. I am not against this, but the efforts for supporting this feature are pretty big. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927374#comment-16927374 ] Lantao Jin commented on SPARK-29038: [~smilegator] Sure, we will totally fellow ANSI SQL when commit although it contains some unstandard ones in our internal version. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927309#comment-16927309 ] Lantao Jin commented on SPARK-29038: Thank you [~jerryshao]. What next should I do? Wait for the end of review or send out a vote email to dev list now? > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927274#comment-16927274 ] Xiao Li commented on SPARK-29038: - https://www.bwdb2ug.org/Presentations/BWDUG_%20MQT.pps is a reference. It only shows the basic ideas how it work, but implementation details are complex. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927270#comment-16927270 ] Dilip Biswal commented on SPARK-29038: -- [~jerryshao] [~smilegator] Thanks. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927268#comment-16927268 ] Xiao Li commented on SPARK-29038: - We need to follow ANSI SQL if we plan to support the materialized views. Materialized views are well defined concepts in DBMSs. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927265#comment-16927265 ] Saisai Shao commented on SPARK-29038: - [~cltlfcjin] I think we need a SPIP review and vote on the dev mail list before starting the works. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927263#comment-16927263 ] Saisai Shao commented on SPARK-29038: - IIUC, I think the key difference between MV and Spark's built-in {{CACHE}} support is: 1. MV needs update when source table is updated, which I think current Spark's {{CACHE}} cannot support; 2. classical MV requires writing of source query based on the existing MV, which I think current Spark doesn't have. Please correct me if I'm wrong. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927258#comment-16927258 ] Dilip Biswal commented on SPARK-29038: -- [~cltlfcjin] Actually i had similar question as [~mgaido]. We have been writing the SQL reference for 3.0 have recently documented {code} CACHE TABLE {code} in [https://github.com/apache/spark/pull/25532]. So in SPARK, it is possible to cache the result of a complex query involving joins, aggregates etc. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927256#comment-16927256 ] Lantao Jin commented on SPARK-29038: [~angerszhuuu]Of course, will contact you offline > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927243#comment-16927243 ] angerszhu commented on SPARK-29038: --- I ma interested in the match about : you create a MV table q1_mv with group by `l_returnflag, l_linestatus, l_shipdate`, your query group by `l_returnflag, l_linestatus` , This may be the most complex place need to be achieved. I wanted to do this in my cache framework, but I couldn't find a good way to do it. Can i contact you with wechat. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927229#comment-16927229 ] Lantao Jin commented on SPARK-29038: [~angerszhuuu] By default, we use Parquet to storage the data of materialized view, but it supports all storage formats Spark supported. We have implemented most matching logic about filter, join and aggregate. But it cannot cover all scenarios, like JoinBack, since Spark current doesn't support PK or dimensions like other DBMS (oracle). > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927217#comment-16927217 ] angerszhu commented on SPARK-29038: --- [~cltlfcjin] *precalculating, alittle like CarbonData's Data map.* *Have you implement the whole matching logic* > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927214#comment-16927214 ] Lantao Jin commented on SPARK-29038: [~mgaido] IIUC, there is no "query caching" in Spark, even no result cache. But Spark natively supports RDD-level cache. Multiple jobs can share cached RDD. The cached RDD is closer to the calculation result and requires less computation. In addition, the file system level cache such as HDFS cache or Alluxio can also load data into memory in advance, improving data processing efficiency. But materialized view actually is a technology about summaries *precalculating*. Summaries are special types of aggregate views that improve query execution times by precalculating expensive joins and aggregation operations prior to execution and storing the results in a table in the database. The query optimizer transparently rewrites the request to use the materialized view. Queries go directly to the materialized view and not to the underlying detail tables which had been materialized to storage like HDFS. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926658#comment-16926658 ] angerszhu commented on SPARK-29038: --- I am doing a similar framework. It can trigger cache sub-query data of sql when it satisfy some condition, and when new sql come, it can check LogicalPlan , if have same part, rewrite LogicalPlan to use cached data. Now it support cache data in memory and alluxio,. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926650#comment-16926650 ] Marco Gaido commented on SPARK-29038: - [~cltlfcjin] currently spark has a something similar, which is query caching, where the user can also select the level of caching performed. My undersatanding is that your proposal is to do something very similar, just with a different syntax, more DB oriented. Is my understanding correct? > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org