[GitHub] spark pull request #22627: [SPARK-25639] [DOCS] Added docs for foreachBatch,...

zsxwing Thu, 04 Oct 2018 13:39:45 -0700

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22627#discussion_r222818615
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -2709,6 +2935,78 @@ write.stream(aggDF, "memory", outputMode = 
"complete", checkpointLocation = "pat
     </div>
     </div>
     
    +
    +## Recovery Semantics after Changes in a Streaming Query
    +There are limitations on what changes in a streaming query are allowed 
between restarts from the 
    +same checkpoint location. Here are a few kinds of changes that are either 
not allowed, or 
    +the effect of the change is not well-defined. For all of them:
    +
    +- The term *allowed* means you can do the specified change but whether the 
semantics of its effect 
    +  is well-defined depends on the query and the change.
    +
    +- The term *not allowed* means you should not do the specified change as 
the restarted query is likely 
    +  to fail with unpredictable errors. `sdf` represents a streaming 
DataFrame/Dataset 
    +  generated with sparkSession.readStream.
    +  
    +**Types of changes**
    +
    +- *Changes in the number or type (i.e. different source) of input 
sources*: This is not allowed.
    +
    +- *Changes in the parameters of input sources*: Whether this is allowed 
and whether the semantics 
    +  of the change are well-defined depends on the source and the query. Here 
are a few examples.
    +
    +  - Addition/deletion/modification of rate limits is allowed: 
`spark.readStream.format("kafka").option("subscribe", "topic")` to 
`spark.readStream.format("kafka").option("subscribe", 
"topic").option("maxOffsetsPerTrigger", ...)`
    +
    +  - Changes to subscribed topics/files is generally not allowed as the 
results are unpredictable: 
`spark.readStream.format("kafka").option("subscribe", "topic")` to 
`spark.readStream.format("kafka").option("subscribe", "newTopic")`
    +
    +- *Changes in the type of output sink*: Changes between a few specific 
combinations of sinks 
    +  are allowed. This needs to be verified on a case-by-case basis. Here are 
a few examples.
    +
    +  - File sink to Kafka sink is allowed. Kafka will see only the new data.
    +
    +  - Kafka sink to file sink is not allowed.
    +
    +  - Kafka sink changed to foreach, or vice versa is allowed.
    +
    +- *Changes in the parameters of output sink*: Whether this is allowed and 
whether the semantics of 
    +  the change are well-defined depends on the sink and the query. Here are 
a few examples.
    +
    +  - Changes to output directory of a file sink is not allowed: 
`sdf.writeStream.format("parquet").option("path", "/somePath")` to 
`sdf.writeStream.format("parquet").option("path", "/anotherPath")`
    +
    +  - Changes to output topic is allowed: 
`sdf.writeStream.format("kafka").option("topic", "someTopic")` to 
`sdf.writeStream.format("kafka").option("path", "anotherTopic")`
    +
    +  - Changes to the user-defined foreach sink (that is, the `ForeachWriter` 
code) is allowed, but the semantics of the change depends on the code.
    +
    +- *Changes in projection / filter / map-like operations**: Some cases are 
allowed. For example:
    +
    +  - Addition / deletion of filters is allowed: `sdf.selectExpr("a")` to 
`sdf.where(...).selectExpr("a").filter(...)`.
    +
    +  - Changes in projections with same output schema is allowed: 
`sdf.selectExpr("stringColumn AS json").writeStream` to 
`sdf.select(to_json(...).as("json")).writeStream`.
    +
    +  - Changes in projections with different output schema are conditionally 
allowed: `sdf.selectExpr("a").writeStream` to `sdf.selectExpr("b").writeStream` 
is allowed only if the output sink allows the schema change from `"a"` to `"b"`.
    +
    +- *Changes in stateful operations*: Some operations in streaming queries 
need to maintain
    +  state data in order to continuously update the result. Structured 
Streaming automatically checkpoints
    +  the state data to fault-tolerant storage (for example, DBFS, AWS S3, 
Azure Blob storage) and restores it after restart.
    --- End diff --
    
    remove `DBFS`?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22627: [SPARK-25639] [DOCS] Added docs for foreachBatch,...

Reply via email to