Hi Albert,

Have a couple of questions:

   - You mentioned near real-time. What exactly is your SLA for processing
   each document?
   - Which crawler are you using and are you looking to bring in Hadoop
   into your overall workflow. You might want to read up on how network
   traffic is minimized/managed on the Hadoop cluster - as you had run into
   network issues with your current architecture.

Thanks!

On Thu, Oct 23, 2014 at 12:07 AM, Albert Vila <albert.v...@augure.com>
wrote:

> Hi
>
> I'm evaluating Spark streaming to see if it fits to scale or current
> architecture.
>
> We are currently downloading and processing 6M documents per day from
> online and social media. We have a different workflow for each type of
> document, but some of the steps are keyword extraction, language detection,
> clustering, classification, indexation, .... We are using Gearman to
> dispatch the job to workers and we have some queues on a database.
> Everything is in near real time.
>
> I'm wondering if we could integrate Spark streaming on the current
> workflow and if it's feasible. One of our main discussions are if we have
> to go to a fully distributed architecture or to a semi-distributed one. I
> mean, distribute everything or process some steps on the same machine
> (crawling, keyword extraction, language detection, indexation). We don't
> know which one scales more, each one has pros and cont.
>
> Now we have a semi-distributed one as we had network problems taking into
> account the amount of data we were moving around. So now, all documents
> crawled on server X, later on are dispatched through Gearman to the same
> server. What we dispatch on Gearman is only the document id, and the
> document data remains on the crawling server on a Memcached, so the network
> traffic is keep at minimum.
>
> It's feasible to remove all database queues and Gearman and move to Spark
> streaming? We are evaluating to add Kakta to the system too.
> Is anyone using Spark streaming for a system like ours?
> Should we worry about the network traffic? or it's something Spark can
> manage without problems. Every document is arround 50k (300Gb a day +/-).
> If we wanted to isolate some steps to be processed on the same machine/s
> (or give priority), is something we could do with Spark?
>
> Any help or comment will be appreciate. And If someone has had a similar
> problem and has knowledge about the architecture approach will be more than
> welcomed.
>
> Thanks
>
>

Reply via email to