[
https://issues.apache.org/jira/browse/SPARK-56451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56451:
-----------------------------------
Labels: pull-request-available (was: )
> Document how SDP datasets are stored and refreshed
> --------------------------------------------------
>
> Key: SPARK-56451
> URL: https://issues.apache.org/jira/browse/SPARK-56451
> Project: Spark
> Issue Type: Documentation
> Components: Declarative Pipelines
> Affects Versions: 4.1.1
> Reporter: Noritaka Sekiyama
> Priority: Major
> Labels: pull-request-available
>
> The Spark Declarative Pipelines programming guide does not explain how
> datasets are stored and refreshed internally. Key information missing
> includes:
> * {*}Default table format{*}: SDP creates tables using Spark's default
> format (parquet via spark.sql.sources.default), but this is not documented.
> * {*}Materialized view refresh behavior{*}: Materialized views perform a
> full recomputation (TRUNCATE + append) on every pipeline run, which is
> fundamentally different from database-native materialized views.
> * {*}Streaming table checkpoint requirement{*}: Streaming tables require a
> checkpoint directory on a Hadoop-compatible file system, but this is not
> explained.
> * {*}Full refresh semantics{*}: The --full-refresh / --full-refresh-all CLI
> options are documented, but the actual effect on each dataset type is not
> described.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]