Storm would be feasible to your business problem. You could actually design the topology in such a way that few bolts would be doing the job of keyword extraction, another set of bolts doing language detection etc etc. You can apply you clusterin g and classification algorithms of Mahout on streams of data processed by bolts.
But only thing that i am concerned is if your data would be coming from some datasource like kafka, that would be great. I don't think spouts reading data from files would be the best fit. Regards, Padma Ch On Tue, Oct 7, 2014 at 3:56 PM, Albert Vila <[email protected]> wrote: > Hi > > I just came across Storm when I was trying to find solutions to scale our > current architecture. > > We are currently downloading and processing 6M documents per day from > online and social media. We have a different workflow for each type of > document, but some of the steps are keyword extraction, language detection, > clustering, classification, indexation, .... We are using Gearman to > dispatch the job to workers. > > I'm wondering if we could integrate Storm on the current workflow and if > it's feasible. One of our main discussions are if we have to go to a fully > distributed architecture or to a semi-distributed one. I mean, distribute > everything or process some steps on the same machine (crawling, keyword > extraction, language detection, indexation). We don't know which one scales > more, each one has pros and cont. > > Now we have a semi-distributed one as we had network problems taking into > account the amount of data we were moving around. So now, all documents > crawled on server X, later on are dispatched through Gearman to the same > server, having all data on a Memcached locally. > > What do you think? > It's feasible to migrate to a Storm cluster? > Should we take into account the traffic among the Storm cluster? > Is there a way to isolate some bolts to be processed on the same machine > grouped by some field? > > Any help or comment will be appreciate. And If someone has had a similar > problem and has knowledge about the architecture approach will be more than > welcomed. > > Thanks > > Albert >
