[ https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shixiong Zhu resolved SPARK-16963. ---------------------------------- Resolution: Fixed Assignee: Frederick Reiss Fix Version/s: 2.1.0 2.0.2 > Change Source API so that sources do not need to keep unbounded state > --------------------------------------------------------------------- > > Key: SPARK-16963 > URL: https://issues.apache.org/jira/browse/SPARK-16963 > Project: Spark > Issue Type: Sub-task > Components: Streaming > Affects Versions: 2.0.0, 2.0.1 > Reporter: Frederick Reiss > Assignee: Frederick Reiss > Fix For: 2.0.2, 2.1.0 > > > The version of the Source API in Spark 2.0.0 defines a single getBatch() > method for fetching records from the source, with the following Scaladoc > comments defining the semantics: > {noformat} > /** > * Returns the data that is between the offsets (`start`, `end`]. When > `start` is `None` then > * the batch should begin with the first available record. This method must > always return the > * same data for a particular `start` and `end` pair. > */ > def getBatch(start: Option[Offset], end: Offset): DataFrame > {noformat} > These semantics mean that a Source must retain all past history for the > stream that it backs. Further, a Source is also required to retain this data > across restarts of the process where the Source is instantiated, even when > the Source is restarted on a different machine. > These restrictions make it difficult to implement the Source API, as any > implementation requires potentially unbounded amounts of distributed storage. > See the mailing list thread at > [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html] > for more information. > This JIRA will cover augmenting the Source API with an additional callback > that will allow Structured Streaming scheduler to notify the source when it > is safe to discard buffered data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org