[jira] [Commented] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

Frederick Reiss (JIRA) Mon, 22 Aug 2016 16:21:40 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431779#comment-15431779
 ]


Frederick Reiss commented on SPARK-16963:
-----------------------------------------

The proposed changes in the attached PR are now ready for review. [~marmbrus] 
can you please have a look at your convenience? [~prashant_] can you also 
please have a look with a particular focus on whether the changes fit with the 
MQTT connector?

> Change Source API so that sources do not need to keep unbounded state
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16963
>                 URL: https://issues.apache.org/jira/browse/SPARK-16963
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 2.0.0
>            Reporter: Frederick Reiss
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() 
> method for fetching records from the source, with the following Scaladoc 
> comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When 
> `start` is `None` then
>  * the batch should begin with the first available record. This method must 
> always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the 
> stream that it backs. Further, a Source is also required to retain this data 
> across restarts of the process where the Source is instantiated, even when 
> the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any 
> implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
>  for more information.
> This JIRA will cover augmenting the Source API with an additional callback 
> that will allow Structured Streaming scheduler to notify the source when it 
> is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

Reply via email to