Hi Albert, Have a couple of questions:
- You mentioned near real-time. What exactly is your SLA for processing each document? - Which crawler are you using and are you looking to bring in Hadoop into your overall workflow. You might want to read up on how network traffic is minimized/managed on the Hadoop cluster - as you had run into network issues with your current architecture. Thanks! On Thu, Oct 23, 2014 at 12:07 AM, Albert Vila <albert.v...@augure.com> wrote: > Hi > > I'm evaluating Spark streaming to see if it fits to scale or current > architecture. > > We are currently downloading and processing 6M documents per day from > online and social media. We have a different workflow for each type of > document, but some of the steps are keyword extraction, language detection, > clustering, classification, indexation, .... We are using Gearman to > dispatch the job to workers and we have some queues on a database. > Everything is in near real time. > > I'm wondering if we could integrate Spark streaming on the current > workflow and if it's feasible. One of our main discussions are if we have > to go to a fully distributed architecture or to a semi-distributed one. I > mean, distribute everything or process some steps on the same machine > (crawling, keyword extraction, language detection, indexation). We don't > know which one scales more, each one has pros and cont. > > Now we have a semi-distributed one as we had network problems taking into > account the amount of data we were moving around. So now, all documents > crawled on server X, later on are dispatched through Gearman to the same > server. What we dispatch on Gearman is only the document id, and the > document data remains on the crawling server on a Memcached, so the network > traffic is keep at minimum. > > It's feasible to remove all database queues and Gearman and move to Spark > streaming? We are evaluating to add Kakta to the system too. > Is anyone using Spark streaming for a system like ours? > Should we worry about the network traffic? or it's something Spark can > manage without problems. Every document is arround 50k (300Gb a day +/-). > If we wanted to isolate some steps to be processed on the same machine/s > (or give priority), is something we could do with Spark? > > Any help or comment will be appreciate. And If someone has had a similar > problem and has knowledge about the architecture approach will be more than > welcomed. > > Thanks > >