Re: Datasource API V2 and checkpointing

Thakrar, Jayesh Fri, 27 Apr 2018 09:30:31 -0700

Thanks Joseph!

From: Joseph Torres <joseph.tor...@databricks.com>
Date: Friday, April 27, 2018 at 11:23 AM
To: "Thakrar, Jayesh" <jthak...@conversantmedia.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Subject: Re: Datasource API V2 and checkpointing


The precise interactions with the DataSourceV2 API haven't yet been hammered 
out in design. But much of this comes down to the core of Structured Streaming 
rather than the API details.

The execution engine handles checkpointing and recovery. It asks the streaming 
data source for offsets, and then determines that batch N contains the data 
between offset A and offset B. On recovery, if batch N needs to be re-run, the 
execution engine just asks the source for the same offset range again. Sources 
also get a handle to their own subfolder of the checkpoint, which they can use 
as scratch space if they need. For example, Spark's FileStreamReader keeps a 
log of all the files it's seen, so its offsets can be simply indices into the 
log rather than huge strings containing all the paths.

SPARK-23323 is orthogonal. That commit coordinator is responsible for ensuring 
that, within a single Spark job, two different tasks can't commit the same 
partition.

On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh 
<jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>> wrote:
Wondering if this issue is related to SPARK-23323?

Any pointers will be greatly appreciated….

Thanks,
Jayesh

From: "Thakrar, Jayesh" 
<jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>>
Date: Monday, April 23, 2018 at 9:49 PM
To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Datasource API V2 and checkpointing

I was wondering when checkpointing is enabled, who does the actual work?
The streaming datasource or the execution engine/driver?

I have written a small/trivial datasource that just generates strings.
After enabling checkpointing, I do see a folder being created under the 
checkpoint folder, but there's nothing else in there.

Same question for write-ahead and recovery?
And on a restart from a failed streaming session - who should set the offsets?
The driver/Spark or the datasource?

Any pointers to design docs would also be greatly appreciated.

Thanks,
Jayesh

Re: Datasource API V2 and checkpointing

Reply via email to