[ 
https://issues.apache.org/jira/browse/SAMZA-863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180577#comment-15180577
 ] 

Yi Pan (Data Infrastructure) commented on SAMZA-863:
----------------------------------------------------

[~xinyu], thanks for the design doc and sorry for the late review. It looks 
great to me. I have just a few additional points, given that 
[~cpettitt-linkedin] and you have already discussed a lot:

# What’s the sequence of wrappedTask.init() and ThreadedStreamTask.init()? I 
assume that ThreadedStreamTask is not public API which we will control the 
implementation of its init()?
# What’s the user code vs what’s Samza framework code? Is the implementation of 
ThreadedStreamTask and ParSeqStreamTask being “reference implementation” of 
user code? Or the interface AsyncStreamTask is more of SPI instead of API and 
framework developer should implement it as a service/lib to be used by 
application developers?
# As for the callbacks not triggered issue, I think that we should implement 
the timeout mechanism in the TaskCallback. Depending on whether the 
implementation enforces in-order execution of the callbacks, the failure of the 
first callback may or may not trigger the failure of all pending callbacks. 
This should be implemented in SamzaContainer or TaskInstance class as well.
# For the race condition between window and process as [~cpettitt-linkedin] 
pointed out, I think to retain the current semantic may be critical for 
existing users. If user choose to implement WindowableTask interface, we should 
make sure that all pending process() are done before we invoke user implemented 
window() method. With the assumption that a) the window() method is not invoked 
often; b) timeout on callbacks will not block window() invocation forever, I 
think that would be a reasonable solution. Also, do we see any need for 
AsyncWindowableTask? We probably can mentioned it as "extra features" not in 
the first design.

Thanks!

> Support multi-threading in samza tasks
> --------------------------------------
>
>                 Key: SAMZA-863
>                 URL: https://issues.apache.org/jira/browse/SAMZA-863
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Xinyu Liu
>            Assignee: Xinyu Liu
>         Attachments: DESIGN-SAMZA-863-0.pdf, DESIGN-SAMZA-863-1.pdf
>
>
> Currently a samza container executes the tasks sequentially in a single 
> thread. For example, we have message 1 and 2 in the pending queue for task 1 
> and task 2. Task 1 will process message 1, and until its completion task 2 
> can process message 2. If we want to handle more messages in parallel, we 
> have to increase the container count, e.g. from 1 to 2 in the example.
> While this solution has been working for many CPU-bound job scenarios, we do 
> see its drawback for IO-bound jobs.In this kind of jobs, the task makes 
> IO/Network requests, i.e, db calls, rest calls or external service RPC calls. 
> These IO calls significantly slow down the task processing. We can increase 
> container number in order to parallelize the IO calls, but it results in low 
> CPU utilization. If we can improve CPU utilization by allocating multiple 
> contains in the same CPU core, it will still cause dramatic memory growth due 
> to the memory being allocated for each container.
> To better scale the performance of IO-bound jobs, we are proposing to support 
> multi-threaded processing in samza. The design proposal will come soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to