[jira] [Commented] (SAMZA-459) Explicit flush for individual output streams

Yi Pan (Data Infrastructure) (JIRA) Wed, 19 Nov 2014 11:39:13 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218365#comment-14218365
 ]


Yi Pan (Data Infrastructure) commented on SAMZA-459:
----------------------------------------------------

Had a offline discussion with Chris on the semantics of the flush and made the 
following conclusion:
1) since tasks should be independent w/ each other, making the flush operation 
per task seems natural. I.e. the following TaskCoordinator.flush() probably 
should take the scope of the current task. The main use case of calling flush 
on all tasks is in container shutdown and is already taken care of.
2) there seems to be some use case on per systemStream flush, as illustrated by 
the use case in the original description. Hence, the 
TaskCoordinator.flush(sysStream: SystemStream) would flush all messages 
buffered for the specific system stream in the current task. 

However, the message in-order semantics is really just kept within a Kafka 
partition, which the tasks should not be relying on the message ordering within 
the same flush call or two consecutive flush calls. e.g. the messages sent in a 
TaskCoordinator.flush()/TaskCoordinator.flush(sysStream) going to different 
Kafka partitions may appear in the system in a different order than the order 
that they are put into the buffer; the messages sent in two consecutive 
TaskCoordinator.flush()/TaskCoordinator.flush(sysStream) calls may go to 
different Kafka partitions and hence may appear in a different order in the 
system.

> Explicit flush for individual output streams
> --------------------------------------------
>
>                 Key: SAMZA-459
>                 URL: https://issues.apache.org/jira/browse/SAMZA-459
>             Project: Samza
>          Issue Type: Improvement
>          Components: container
>    Affects Versions: 0.9.0
>            Reporter: Ben Kirwin
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> From the mailing list:
> http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201411.mbox/%3CCACuX-D8-CS7867ob47fqytCAdvGURc4owv82Rhg2oEJYmr8hpg%40mail.gmail.com%3E
> At the moment, the only way to trigger a flush of the output streams is to 
> call TaskCoordinator.commit, which also flushes the state and saves the 
> checkpoints. There are a few cases where more granularity would be useful: 
> writing out a single stream can be much faster than doing a full commit, and 
> if a user cares about the order in which messages are published, they can 
> disable the autocommit and trigger flushes manually.
>  I'd anticipate this to look something like 
> TaskCoordinator.flush(systemStream). It looks like the TaskCoordinator 
> normally only queues up work, instead of doing it synchronously -- if that's 
> the case, it should be enough to buffer up all the requested flushes, then 
> perform them in order when the moment comes.
> Note: you could get *almost* the same effect by switching to a synchronous 
> system and letting the user send a batch of messages all at once, much as the 
> underlying Kafka client does. This woudn't let you flush a changelog stream, 
> though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-459) Explicit flush for individual output streams

Reply via email to