[
https://issues.apache.org/jira/browse/SAMZA-459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218365#comment-14218365
]
Yi Pan (Data Infrastructure) commented on SAMZA-459:
----------------------------------------------------
Had a offline discussion with Chris on the semantics of the flush and made the
following conclusion:
1) since tasks should be independent w/ each other, making the flush operation
per task seems natural. I.e. the following TaskCoordinator.flush() probably
should take the scope of the current task. The main use case of calling flush
on all tasks is in container shutdown and is already taken care of.
2) there seems to be some use case on per systemStream flush, as illustrated by
the use case in the original description. Hence, the
TaskCoordinator.flush(sysStream: SystemStream) would flush all messages
buffered for the specific system stream in the current task.
However, the message in-order semantics is really just kept within a Kafka
partition, which the tasks should not be relying on the message ordering within
the same flush call or two consecutive flush calls. e.g. the messages sent in a
TaskCoordinator.flush()/TaskCoordinator.flush(sysStream) going to different
Kafka partitions may appear in the system in a different order than the order
that they are put into the buffer; the messages sent in two consecutive
TaskCoordinator.flush()/TaskCoordinator.flush(sysStream) calls may go to
different Kafka partitions and hence may appear in a different order in the
system.
> Explicit flush for individual output streams
> --------------------------------------------
>
> Key: SAMZA-459
> URL: https://issues.apache.org/jira/browse/SAMZA-459
> Project: Samza
> Issue Type: Improvement
> Components: container
> Affects Versions: 0.9.0
> Reporter: Ben Kirwin
> Priority: Minor
> Fix For: 0.9.0
>
>
> From the mailing list:
> http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201411.mbox/%3CCACuX-D8-CS7867ob47fqytCAdvGURc4owv82Rhg2oEJYmr8hpg%40mail.gmail.com%3E
> At the moment, the only way to trigger a flush of the output streams is to
> call TaskCoordinator.commit, which also flushes the state and saves the
> checkpoints. There are a few cases where more granularity would be useful:
> writing out a single stream can be much faster than doing a full commit, and
> if a user cares about the order in which messages are published, they can
> disable the autocommit and trigger flushes manually.
> I'd anticipate this to look something like
> TaskCoordinator.flush(systemStream). It looks like the TaskCoordinator
> normally only queues up work, instead of doing it synchronously -- if that's
> the case, it should be enough to buffer up all the requested flushes, then
> perform them in order when the moment comes.
> Note: you could get *almost* the same effect by switching to a synchronous
> system and letting the user send a batch of messages all at once, much as the
> underlying Kafka client does. This woudn't let you flush a changelog stream,
> though.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)