Why don't we just have the source return a Serializable of state when it
reports offsets? Then Spark could handle storing the source's state and the
source wouldn't need to worry about file system paths. I think that would
be easier for implementations and better for recovery because it wouldn't
leave unknown state on a single machine's file system.

rb

On Fri, Apr 27, 2018 at 9:23 AM, Joseph Torres <joseph.tor...@databricks.com
> wrote:

> The precise interactions with the DataSourceV2 API haven't yet been
> hammered out in design. But much of this comes down to the core of
> Structured Streaming rather than the API details.
>
> The execution engine handles checkpointing and recovery. It asks the
> streaming data source for offsets, and then determines that batch N
> contains the data between offset A and offset B. On recovery, if batch N
> needs to be re-run, the execution engine just asks the source for the same
> offset range again. Sources also get a handle to their own subfolder of the
> checkpoint, which they can use as scratch space if they need. For example,
> Spark's FileStreamReader keeps a log of all the files it's seen, so its
> offsets can be simply indices into the log rather than huge strings
> containing all the paths.
>
> SPARK-23323 is orthogonal. That commit coordinator is responsible for
> ensuring that, within a single Spark job, two different tasks can't commit
> the same partition.
>
> On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
>> Wondering if this issue is related to SPARK-23323?
>>
>>
>>
>> Any pointers will be greatly appreciated….
>>
>>
>>
>> Thanks,
>>
>> Jayesh
>>
>>
>>
>> *From: *"Thakrar, Jayesh" <jthak...@conversantmedia.com>
>> *Date: *Monday, April 23, 2018 at 9:49 PM
>> *To: *"dev@spark.apache.org" <dev@spark.apache.org>
>> *Subject: *Datasource API V2 and checkpointing
>>
>>
>>
>> I was wondering when checkpointing is enabled, who does the actual work?
>>
>> The streaming datasource or the execution engine/driver?
>>
>>
>>
>> I have written a small/trivial datasource that just generates strings.
>>
>> After enabling checkpointing, I do see a folder being created under the
>> checkpoint folder, but there's nothing else in there.
>>
>>
>>
>> Same question for write-ahead and recovery?
>>
>> And on a restart from a failed streaming session - who should set the
>> offsets?
>>
>> The driver/Spark or the datasource?
>>
>>
>>
>> Any pointers to design docs would also be greatly appreciated.
>>
>>
>>
>> Thanks,
>>
>> Jayesh
>>
>>
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to