[
https://issues.apache.org/jira/browse/SPARK-53784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Boyang Jerry Peng updated SPARK-53784:
--------------------------------------
Description:
Currently in Structured Streaming, start and end offsets are determined at the
driver prior to running the micro-batch. In real-time mode, end offsets are
not known apriori. They
are communicated to the driver later by the executors at the end of microbatch
that runs for a fixed amount of time. Thus, we need to add additional APIs in
the source to support this kind of behavior.
The lifecycle of the new API is the following
# prepareForRealTimeMode
** Called during logical planning to inform the source if it's in real time
mode
# planInputPartitions
** The driver plans partitions via planPartitions but only a starting offset
is provided (Compared to existing execution modes that require planPartitions
to provide both a starting and end offset)
# mergeOffsets
** Merge partitioned offsets coming from partitions/tasks to a single global
offset.
was:
Currently in Structured Streaming, start and end offsets are determined at the
driver prior to running the micro-batch. In real-time mode, end offsets are
not known apriori. They
are communicated to the driver later by the executors at the end of microbatch
that runs for a fixed amount of time. Thus, we need to add additional APIs in
the source to support this kind of behavior.
The lifecycle of the new API is the following
# prepareForRealTimeMode
**
Called during logical planning to inform the source if it's in real time mode
# planInputPartitions
** The driver plans partitions via planPartitions but only a starting offset
is provided (Compared to existing execution modes that require planPartitions
to provide both a starting and end offset)
#
mergeOffsets
##
Merge partitioned offsets coming from partitions/tasks to a single global
offset.
> Additional Source APIs needed to support RTM execution
> ------------------------------------------------------
>
> Key: SPARK-53784
> URL: https://issues.apache.org/jira/browse/SPARK-53784
> Project: Spark
> Issue Type: Story
> Components: Structured Streaming
> Affects Versions: 4.1.0
> Reporter: Boyang Jerry Peng
> Priority: Major
>
> Currently in Structured Streaming, start and end offsets are determined at
> the driver prior to running the micro-batch. In real-time mode, end offsets
> are not known apriori. They
> are communicated to the driver later by the executors at the end of
> microbatch that runs for a fixed amount of time. Thus, we need to add
> additional APIs in the source to support this kind of behavior.
> The lifecycle of the new API is the following
> # prepareForRealTimeMode
> ** Called during logical planning to inform the source if it's in real time
> mode
> # planInputPartitions
> ** The driver plans partitions via planPartitions but only a starting offset
> is provided (Compared to existing execution modes that require planPartitions
> to provide both a starting and end offset)
> # mergeOffsets
> ** Merge partitioned offsets coming from partitions/tasks to a single global
> offset.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]