Hi Michael, Samza is designed for high-throughput and realtime processing. If you are using HTTP request/external service, you may not retrieve the same performance as not using it. However, technically speaking, there is nothing blocking you to do this, (well, discouraged anyway :). Samza by default does not provide this feature. So you maybe a little cautious when implementing this.
Thanks, Fang, Yan yanfang...@gmail.com On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mikesk...@gmail.com> wrote: > Hi, > > What would be the best approach for doing "blocking" operations in Samza? > > For example, we have a kafka stream of urls for which we need to gather > external data via HTTP (such as alexa rank, get the page title and > headers..). Other scenarios include database access and decision making via > a rule engine. > > Samza processes messages in a singe thread, HTTP requests might take > hundreds of miliseconds. With the single threaded design the throughput > would be very limited, which can be solved with an asynchronous approach. > However Samza documentation explicitely states > "*You are strongly discouraged from using threads in your job’s code*". > > It seems that Samza design suits very well "data transformation" scenarios, > what is not clear is how well can it support external services? > > Thanks, > Michael Sklyar >