Re: Asynchronous approach and samza

2015-09-21 Thread Michael Sklyar
Thank you for your replies, I understand that making an external blocking request in a single event thread will result in extremely low throughput. However this can be solved by multi threading and/or asynchronous approach. It is clear that in any case using external services can never achieve the

Re: Review Request 33453: SAMZA-557 Reuse local state in SamzaContainer on clean shutdown

2015-09-21 Thread Navina Ramesh
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33453/ --- (Updated Sept. 21, 2015, 9:12 a.m.) Review request for samza, Yan Fang, Chris R

Re: Asynchronous approach and samza

2015-09-21 Thread Navina Ramesh
Hi Michael, {quote} Do you mean that in such a case Samza should be combined with another Stream processing framework (such as Storm)? {quote} No. I didn't mean combining it with any other framework. {quote} "the job bootstraps the data from the source" - do you mean that you have a background pro

Re: Asynchronous approach and samza

2015-09-21 Thread Michael Sklyar
Thanks Navina, it is much more clear now. Unfortunately, in our case, we can not bootstrap the data in advance(we can't pre-fetch all existing URL's titles and headers in advance). Sounds to me that, if we want to use Samza, we will need a background process that will be synchronized with the main

RE: Asynchronous approach and samza

2015-09-21 Thread Ken Krugler
Hi Michael (& Navina), I don't think you need to create a separate background process, at least for the case of web crawling. The challenge is to efficiently use one Samza process to simultaneously fetch many URLs. Which does increase the complexity of that process's code, as you wind up havi

Re: Asynchronous approach and samza

2015-09-21 Thread Jordan Shaw
Michael, Why not just have a pool of workers outside of Samza that are pushing the raw, or subset of the raw crawler input into a Kafka topic then have the Samza do the compute/stream work? Basically Samza is not the right tool for what your suggesting but could be used for downstream work, in my o

Re: Asynchronous approach and samza

2015-09-21 Thread Navina Ramesh
@Ken: I was going to suggest batch processing with Samza, which is pretty much what you just said. Thanks for your valuable input. :) @Michael: I think the pattern I suggested will not work out for your data scale. Following a batch processing model with Samza can fulfill the requirements of your

Re: Review Request 37817: SAMZA-619 - Modify SamzaAppMaster to enable host-affinity

2015-09-21 Thread Yan Fang
> On Sept. 11, 2015, 1:47 a.m., Yan Fang wrote: > > samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java, > > line 80 > > > > > > this getId is for the global container Id, right? > > Navina Ra