Thanks a lot for the detailed reply. Responses inline:

Agree -- it's easiest if there's only one message at a time being sent to
> the child process. Though we should benchmark that to make sure that
> performance is still good.
>

Sounds good to me. The latency of the system calls + marshaling might turn
out to be significant.


> With one message at a time, a task always has to write something to stdout
> for every message it consumes, even if it doesn't want to emit an output
> for a particular input message -- otherwise Samza wouldn't know when to
> send the next message to the task.
>

That's true, I didn't think of that. It shouldn't be hard though.


> Another thing to think about: do we want one child process per task, or
> one per container? One per task is a simpler processing model (matches the
> Java API), but one per container perhaps makes more sense from a resource
> allocation point of view.
>

I think one per task. If the external process is stateful, that state
should be limited to a single partition. But I don't understand Samza well
enough to say whether this is the overriding concern. Also, like you say,
the programming model is much nicer.


> Yeah, I think allowing access to the KV store via stdin/stdout protocol
> makes the most sense. For example, to make a "get" request to the store,
> the task could write to stdout:
>
> {"cmd": "kv_get", "store": "my-store", "key": "foo"}
>
> to which Samza would respond by sending to stdin:
>
> {"cmd": "kv_get_response", "store": "my-store", "key": "foo", "value":
> "bar"}
>

Sounds good to me.

Thanks again,
Dave

Reply via email to