[ 
https://issues.apache.org/jira/browse/SPARK-56451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56451:
-----------------------------------
    Labels: pull-request-available  (was: )

> Document how SDP datasets are stored and refreshed
> --------------------------------------------------
>
>                 Key: SPARK-56451
>                 URL: https://issues.apache.org/jira/browse/SPARK-56451
>             Project: Spark
>          Issue Type: Documentation
>          Components: Declarative Pipelines
>    Affects Versions: 4.1.1
>            Reporter: Noritaka Sekiyama
>            Priority: Major
>              Labels: pull-request-available
>
>  The Spark Declarative Pipelines programming guide does not explain how 
> datasets are stored and refreshed internally. Key information missing 
> includes:
>  * {*}Default table format{*}: SDP creates tables using Spark's default 
> format (parquet via spark.sql.sources.default), but this is not documented.
>  * {*}Materialized view refresh behavior{*}: Materialized views perform a 
> full recomputation (TRUNCATE + append) on every pipeline run, which is 
> fundamentally different from database-native materialized views.
>  * {*}Streaming table checkpoint requirement{*}: Streaming tables require a 
> checkpoint directory on a Hadoop-compatible file system, but this is not 
> explained.
>  * {*}Full refresh semantics{*}: The --full-refresh / --full-refresh-all CLI 
> options are documented, but the actual effect on each dataset type is not 
> described.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to